"I’ve also considered to use Kafka to message between Web UI and the pipes, I think it will fit. Chaining the pipes together as a workflow and implementing, managing and monitoring these long running user tasks with locality as I need them is still causing me headache."
You can look at Apache Samza for the chaining the pipes together and managing long-running tasks with locality. The long-running tasks run on YARN, can keep fault-tolerant local state (using LevelDB or RocksDB), use Kafka for message passing and state replication/persistence. On Tue, Dec 2, 2014 at 3:26 AM, Stadin, Benjamin < benjamin.sta...@heidelberg-mobil.com> wrote: > To be precise I want the workflow to be associated to a user, but it > doesn’t need to be run as part of or depend on a session. I can’t run > scheduled jobs, because a user can potentially upload hundreds of files > which trigger a long running batch import / update process but he could > also make a very small upload / update and immediately wants to continue to > work on the (temporary) data that he just uploaded. So that same workflow > duration may vary between some seconds, a minute and hours, completely > depending on the project's size. > > So a user can log off and on again to the web site and the initial upload > + conversion step may either be still running or finished. He’ll see the > progress on the web site, and once the initial processing is done he can > continue with the next step of the import workflow, he can interactively > change some stuff on that temporary data. After he is done changing stuff, > he can hit a „continue“ button which triggers again a long or short running > post-processing pipe. Then the user can make a final review of that now > post-processed data, and after hitting a „save“ button a final commits pipe > pushes / merges the until now temporary data to some persistent store. > > You’re completely right about that I should simplify as much as possible. > Finding the right mix seems key. I’ve also considered to use Kafka to > message between Web UI and the pipes, I think it will fit. Chaining the > pipes together as a workflow and implementing, managing and monitoring > these long running user tasks with locality as I need them is still > causing me headache. > > Btw, the tiling and indexing is not a problem. My propblem is mainly in > parallelized conversion, polygon creation, cleaning of CAD file data (e.g. > GRASS, prepair, custom tools). After all parts have been preprocessed and > gathered in one place, the initial creation of the preview geo file is > taking a fraction of the time (inserting all data in one transaction, > taking somewhere between sub-second and < 10 seconds for very large > projects). It’s currently not a concern. > > (searching for a Kafka+Spark example now) > > Cheers > Ben > > > Von: andy petrella <andy.petre...@gmail.com> > Datum: Dienstag, 2. Dezember 2014 10:00 > > An: Benjamin Stadin <benjamin.sta...@heidelberg-mobil.com>, " > user@spark.apache.org" <user@spark.apache.org> > Betreff: Re: Is Spark the right tool for me? > > The point 4 looks weird to me, I mean if you intent to have such workflow > to run in a single session (maybe consider sessionless arch) > Is such process for each user? If it's the case, maybe finding a way to > do it for all at once would be better (more data but less scheduling). > > For the micro updates, considering something like a queue (kestrel? or > even kafk... whatever, something that works) would be great. So you remove > the load off the instances, and the updates can be done at its own pace. > Also, you can reuse it to notify the WMS. > Isn't there a way to do tiling directly? Also, do you need indexes, I mean > do you need the full OGIS power, or just some classical operators are > enough (using BBox only for instance)? > > The more you can simplify the better :-D. > > These are only my2c, it's hard to think or react appropriately without > knowing the whole context. > BTW, to answer your very first question: yes, it looks like Spark will > help you! > > cheers, > andy > > > > On Mon Dec 01 2014 at 4:36:44 PM Stadin, Benjamin < > benjamin.sta...@heidelberg-mobil.com> wrote: > >> Yes, the processing causes the most stress. But this is parallizeable by >> splitting the input source. My problem is that once the heavy preprocessing >> is done, I’m in a „micro-update“ mode so to say (user-interactive part of >> the whole workflow). Then the map is rendered directly from the SQLite file >> by the map server instance on that machine – this is actually a favorable >> setup for me for resource consumption and implementation costs (I just need >> to tell the web ui to refresh after something was written to the db, and >> the map server will render the updates without me changing / coding >> anything). So my workflow requires to break out of parallel processing for >> some time. >> >> Do you think for my my generalized workflow and tool chain can be like so? >> >> 1. Pre-Process many files in a parallel way. Gather all results, >> deploy them on one single machine. => Spark coalesce() + Crunch (for >> splitting input files into separate tasks) >> 2. On the machine where preprocessed results are on, configure a map >> server to connect to the local SQLite source. Do user-interactive >> micro-updates on that file (web UI gets updated). >> 3. Post-process the files in parallel. => Spark + Crunch >> 4. Design all of the above as a workflow, runnable (or assignable) as >> part of a user session. => Oozie >> >> Do you think this is ok? >> >> ~Ben >> >> >> Von: andy petrella <andy.petre...@gmail.com> >> Datum: Montag, 1. Dezember 2014 15:48 >> >> An: Benjamin Stadin <benjamin.sta...@heidelberg-mobil.com>, " >> user@spark.apache.org" <user@spark.apache.org> >> Betreff: Re: Is Spark the right tool for me? >> >> Indeed. However, I guess the important load and stress is in the >> processing of the 3D data (DEM or alike) into geometries/shades/whatever. >> Hence you can use spark (geotrellis can be tricky for 3D, poke @lossyrob >> for more info) to perform these operations then keep an RDD of only the >> resulting geometries. >> Those geometries won't probably that heavy, hence it might be possible to >> coalesce(1, true) to have to whole thing on one node (or if your driver is >> more beefy, do a collect/foreach) to create the index. >> You could also create a GeoJSON of the geometries and create the r-tree >> on it (not sure about this one). >> >> >> >> On Mon Dec 01 2014 at 3:38:00 PM Stadin, Benjamin < >> benjamin.sta...@heidelberg-mobil.com> wrote: >> >>> Thank you for mentioning GeoTrellis. I haven’t heard of this before. We >>> have many custom tools and steps, I’ll check our tools fit in. The end >>> result after is actually a 3D map for native OpenGL based rendering on iOS >>> / Android [1]. >>> >>> I’m using GeoPackage which is basically SQLite with R-Tree and a small >>> library around it (more lightweight than SpatialLite). I want to avoid >>> accessing the SQLite db from any other machine or task, that’s where I >>> thought I can use a long running task which is the only process responsible >>> to update a local-only stored SQLite db file. As you also said SQLite (or >>> mostly any other file based db) won’t work well over network. This isn’t >>> only limited to R-Tree but expected limitation because of file locking >>> issues as documented also by SQLite. >>> >>> I also thought to do the same thing when rendering the (web) maps. In >>> combination with the db handler which does the actual changes, I thought to >>> run a map server instance on each node, configure it to add the database >>> location as map source once the task starts. >>> >>> Cheers >>> Ben >>> >>> [1] http://www.deep-map.com >>> >>> Von: andy petrella <andy.petre...@gmail.com> >>> Datum: Montag, 1. Dezember 2014 15:07 >>> An: Benjamin Stadin <benjamin.sta...@heidelberg-mobil.com>, " >>> user@spark.apache.org" <user@spark.apache.org> >>> Betreff: Re: Is Spark the right tool for me? >>> >>> Not quite sure which geo processing you're doing are they raster, >>> vector? More info will be appreciated for me to help you further. >>> >>> Meanwhile I can try to give some hints, for instance, did you considered >>> GeoMesa <http://www.geomesa.org/2014/08/05/spark/>? >>> Since you need a WMS (or alike), did you considered GeoTrellis >>> <http://geotrellis.io/> (go to the batch processing)? >>> >>> When you say SQLite, you mean that you're using Spatialite? Or your db >>> is not a geo one, and it's simple SQLite. In case you need an r-tree (or >>> related) index, you're headaches will come from congestion within your >>> database transaction... unless you go to a dedicated database like Vertica >>> (just mentioning) >>> >>> kr, >>> andy >>> >>> >>> >>> On Mon Dec 01 2014 at 2:49:44 PM Stadin, Benjamin < >>> benjamin.sta...@heidelberg-mobil.com> wrote: >>> >>>> Hi all, >>>> >>>> I need some advise whether Spark is the right tool for my zoo. My >>>> requirements share commonalities with „big data“, workflow coordination and >>>> „reactive“ event driven data processing (as in for example Haskell Arrows), >>>> which doesn’t make it any easier to decide on a tool set. >>>> >>>> NB: I have asked a similar question on the Storm mailing list, but have >>>> been deferred to Spark. I previously thought Storm was closer to my needs – >>>> but maybe neither is. >>>> >>>> To explain my needs it’s probably best to give an example scenario: >>>> >>>> - A user uploads small files (typically 1-200 files, file size >>>> typically 2-10MB per file) >>>> - Files should be converted in parallel and on available nodes. The >>>> conversion is actually done via native tools, so there is not so much >>>> big >>>> data processing required, but dynamic parallelization (so for example to >>>> split the conversion step into as many conversion tasks as files are >>>> available). The conversion typically takes between several minutes and a >>>> few hours. >>>> - The converted files gathered and are stored in a single database >>>> (containing geometries for rendering) >>>> - Once the db is ready, a web map server is (re-)configured and the >>>> user can make small updates to the data set via a web UI. >>>> - … Some other data processing steps which I leave away for brevity >>>> … >>>> - There will be initially only a few concurrent users, but the >>>> system shall be able to scale if needed >>>> >>>> My current thoughts: >>>> >>>> - I should avoid to upload files into the distributed storage >>>> during conversion, but probably should rather have each conversion >>>> filter >>>> download the file it is actually converting from a shared place. Other >>>> wise >>>> it’s bad for scalability reasons (too many redundant copies of same >>>> temporary files if there are many concurrent users and many cluster >>>> nodes). >>>> - Apache Oozie seems an option to chain together my pipes into a >>>> workflow. But is it a good fit with Spark? What options do I have with >>>> Spark to chain a workflow from pipes? >>>> - Apache Crunch seems to make it easy to dynamically parallelize >>>> tasks (Oozie itself can’t do this). But I may not need crunch after all >>>> if >>>> I have Spark, and it also doesn’t seem to fit to my last problem >>>> following. >>>> - The part that causes me the most headache is the user interactive >>>> db update: I consider to use Kafka as message bus to broker between the >>>> web >>>> UI and a custom db handler (nb, the db is a SQLite file). But how >>>> about update responsiveness, isn’t it that Spark will cause some lags >>>> (as >>>> opposed to Storm)? >>>> - The db handler probably has to be implemented as a long running >>>> continuing task, so when a user sends some changes the handler writes >>>> these >>>> to the db file. However, I want this to be decoupled from the job. So >>>> file >>>> these updates should be done locally only on the machine that started >>>> the >>>> job for the whole lifetime of this user interaction. Does Spark allow to >>>> create such long running tasks dynamically, so that when another (web) >>>> user >>>> starts a new task a new long–running task is created and run on the same >>>> node, which eventually ends and triggers the next task? Also, is it >>>> possible to identify a running task, so that a long running task can be >>>> bound to a session (db handler working on local db updates, until task >>>> done), and eventually restarted / recreated on failure? >>>> >>>> >>>> ~Ben >>>> >>>