Incremental (online) machine learning algorithms on ML
Hi, After searching the machine learning library for streaming algorithms, I found two that fit the criteria: Streaming Linear Regression (https://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression) and Streaming K-Means (https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means). However, both use the RDD-based API MLlib instead of the DataFrame-based API ML; are there any plans for bringing them both to ML? Also, is there any technical reason why there are so few incremental algorithms on the machine learning library? There's only 1 algorithm for regression and clustering each, with nothing for classification, dimensionality reduction or feature extraction. If there is a reason, how were those two algorithms implemented? If there isn't, what is the general consensus on adding new online machine learning algorithms? Regards, Lucas Chagas - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Incremental (online) machine learning algorithms on ML
There are several high bars to getting a new algorithm adopted. * It needs to be deemed by the MLLib committers/shepherds as widely useful to the community. Algorithms offered by larger companies after having demonstrated usefulness at scale for use cases likely to be encountered by many other companies stand a better chance * There is quite limited bandwidth for consideration of new algorithms: there has been a dearth of new ones accepted since early 2015 . So prioritization is a challenge. * The code must demonstrate high quality standards especially wrt testability, maintainability, computational performance, and scalability. * The chosen algorithms and options should be well documented and include comparisons/ tradeoffs with state of the art described in relevant papers. These questions will typically be asked during design/code reviews - i.e. did you consider the approach shown *here * * There is also luck and timing involved. The review process might start in a given month A but reviewers become busy or higher priorities intervene .. and then when will the reviewing continue.. * At the point that the above are complete then there are intricacies with integrating with a particular Spark release Am Mo., 5. Aug. 2019 um 05:58 Uhr schrieb chagas : > Hi, > > After searching the machine learning library for streaming algorithms, I > found two that fit the criteria: Streaming Linear Regression > ( > https://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression) > > and Streaming K-Means > ( > https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means > ). > > However, both use the RDD-based API MLlib instead of the DataFrame-based > API ML; are there any plans for bringing them both to ML? > > Also, is there any technical reason why there are so few incremental > algorithms on the machine learning library? There's only 1 algorithm for > regression and clustering each, with nothing for classification, > dimensionality reduction or feature extraction. > > If there is a reason, how were those two algorithms implemented? If > there isn't, what is the general consensus on adding new online machine > learning algorithms? > > Regards, > Lucas Chagas > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Spark and Oozie
Hi William, because it is the only job that is running I don't think it is resource contention. We have configured capacity scheduler which means using yarn queues. As it is the only job I cant see that it is waiting somehow in the queue. Br, Dennis Von meinem iPhone gesendet > Am 20.07.2019 um 01:48 schrieb William Shen : > > Dennis, do you know what’s taking the additional time? Is it the Spark Job, > or oozie waiting for allocation from YARN? Do you have resource contention > issue in YARN? > >> On Fri, Jul 19, 2019 at 12:24 AM Bartek Dobija >> wrote: >> Hi Dennis, >> >> Oozie jobs shouldn't take that long in a well configured cluster. Oozie >> allocates it's own resources in Yarn which may require fine tuning. Check if >> YARN gives resources to the Oozie job immediately which may be one of the >> reasons and change jobs priorities in YARN scheduling configuration. >> >> Alternatively check the Apache Airflow project which is a good alternative >> to Oozie. >> >> Regards, >> Bartek >> >>> On Fri, Jul 19, 2019, 09:09 Dennis Suhari >>> wrote: >>> >>> Dear experts, >>> >>> I am using Spark for processing data from HDFS (hadoop). These Spark >>> application are data pipelines, data wrangling and machine learning >>> applications. Thus Spark submits its job using YARN. >>> This also works well. For scheduling I am now trying to use Apache Oozie, >>> but I am facing performqnce impacts. A Spark job which tooks 44 seconds >>> when submitting it via CLI now takes nearly 3 Minutes. >>> >>> Have you faced similar experiences in using Oozie for scheduling Spark >>> application jobs ? What alternative workflow tools are you using for >>> scheduling Spark jobs on Hadoop ? >>> >>> >>> Br, >>> >>> Dennis >>> >>> Von meinem iPhone gesendet >>> Von meinem iPhone gesendet >>> >>> - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
spark job getting hang
I am running spark job and if i run it sometimes it ran successfully but most of the time getting ERROR Dropping event from queue appStatus. This likely means one of the listeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler. (org.apache.spark.scheduler.AsyncEventQueue). Please suggest what how to debug this issue. Thanks Amit
How to programmatically pause and resume Spark/Kafka structured streaming?
Hi All, I am trying to see if there is a way to pause a spark stream that process data from Kafka such that my application can take some actions while the stream is paused and resume when the application completes those actions. Thanks!
Re: How to programmatically pause and resume Spark/Kafka structured streaming?
Hi, exactly my question, I was also looking for ways to gracefully exit spark structured streaming. Regards, Gourav On Tue, Aug 6, 2019 at 3:43 AM kant kodali wrote: > Hi All, > > I am trying to see if there is a way to pause a spark stream that process > data from Kafka such that my application can take some actions while the > stream is paused and resume when the application completes those actions. > > Thanks! >
Hive external table not working in sparkSQL when subdirectories are present
Hi. I have built a Hive external table on top of a directory 'A' which has data stored in ORC format. This directory has several subdirectories inside it, each of which contains the actual ORC files. These subdirectories are actually created by spark jobs which ingest data from other sources and write it into this directory. I tried creating a table and setting the table properties of the same as *hive.mapred.supports.subdirectories=TRUE* and *mapred.input.dir.recursive* *=TRUE*. As a result of this, when i fire the simplest query of *select count(*) from ExtTable* via the Hive CLI, it successfully gives me the expected count of records in the table. However, when i fire the same query via sparkSQL, i get count = 0. I think the sparkSQL isn't able to descend into the subdirectories for getting the data while hive is able to do so. Are there any configurations needed to be set on the spark side so that this works as it does via hive cli? I am using Spark on YARN. Thanks, Rishikesh Tags: subdirectories, subdirectory, recursive, recursion, hive external table, orc, sparksql, yarn