Incremental (online) machine learning algorithms on ML

2019-08-05 Thread chagas

Hi,

After searching the machine learning library for streaming algorithms, I 
found two that fit the criteria: Streaming Linear Regression 
(https://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression) 
and Streaming K-Means 
(https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means).


However, both use the RDD-based API MLlib instead of the DataFrame-based 
API ML; are there any plans for bringing them both to ML?


Also, is there any technical reason why there are so few incremental 
algorithms on the machine learning library? There's only 1 algorithm for 
regression and clustering each, with nothing for classification, 
dimensionality reduction or feature extraction.


If there is a reason, how were those two algorithms implemented? If 
there isn't, what is the general consensus on adding new online machine 
learning algorithms?


Regards,
Lucas Chagas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Incremental (online) machine learning algorithms on ML

2019-08-05 Thread Stephen Boesch
There are several high bars to getting a new algorithm adopted.

*  It needs to be deemed by the MLLib committers/shepherds as widely useful
to the community.  Algorithms offered by larger companies after having
demonstrated usefulness at scale for   use cases  likely to be encountered
by many other companies stand a better chance
* There is quite limited bandwidth for consideration of new algorithms:
there has been a dearth of new ones accepted since early 2015 . So
prioritization is a challenge.
* The code must demonstrate high quality standards especially wrt
testability, maintainability, computational performance, and scalability.
* The chosen algorithms and options should be well documented and include
comparisons/ tradeoffs with state of the art described in relevant papers.
These questions will typically be asked during design/code reviews - i.e.
did you consider the approach shown *here *
* There is also luck and timing involved. The review process might start in
a given month A but reviewers become busy or higher priorities intervene ..
and then when will the reviewing continue..
* At the point that the above are complete then there are intricacies with
integrating with a particular Spark release

Am Mo., 5. Aug. 2019 um 05:58 Uhr schrieb chagas :

> Hi,
>
> After searching the machine learning library for streaming algorithms, I
> found two that fit the criteria: Streaming Linear Regression
> (
> https://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression)
>
> and Streaming K-Means
> (
> https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means
> ).
>
> However, both use the RDD-based API MLlib instead of the DataFrame-based
> API ML; are there any plans for bringing them both to ML?
>
> Also, is there any technical reason why there are so few incremental
> algorithms on the machine learning library? There's only 1 algorithm for
> regression and clustering each, with nothing for classification,
> dimensionality reduction or feature extraction.
>
> If there is a reason, how were those two algorithms implemented? If
> there isn't, what is the general consensus on adding new online machine
> learning algorithms?
>
> Regards,
> Lucas Chagas
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark and Oozie

2019-08-05 Thread Dennis Suhari
Hi William,

because it is the only job that is running I don't think it is resource 
contention. We have configured capacity scheduler which means using yarn 
queues. As it is the only job I cant see that it is waiting somehow in the 
queue. 

Br,

Dennis

Von meinem iPhone gesendet

> Am 20.07.2019 um 01:48 schrieb William Shen :
> 
> Dennis, do you know what’s taking the additional time? Is it the Spark Job, 
> or oozie waiting for allocation from YARN? Do you have resource contention 
> issue in YARN?
> 
>> On Fri, Jul 19, 2019 at 12:24 AM Bartek Dobija  
>> wrote:
>> Hi Dennis, 
>> 
>> Oozie jobs shouldn't take that long in a well configured cluster. Oozie 
>> allocates it's own resources in Yarn which may require fine tuning. Check if 
>> YARN gives resources to the Oozie job immediately which may be one of the 
>> reasons and change jobs priorities in YARN scheduling configuration.  
>> 
>> Alternatively check the Apache Airflow project which is a good alternative 
>> to Oozie. 
>> 
>> Regards,
>> Bartek 
>> 
>>> On Fri, Jul 19, 2019, 09:09 Dennis Suhari  
>>> wrote:
>>> 
>>> Dear experts,
>>> 
>>> I am using Spark for processing data from HDFS (hadoop). These Spark 
>>> application are data pipelines, data wrangling and machine learning 
>>> applications. Thus Spark submits its job using YARN. 
>>> This also works well. For scheduling I am now trying to use Apache Oozie, 
>>> but I am facing performqnce impacts. A Spark job which tooks 44 seconds 
>>> when submitting it via CLI now takes nearly 3 Minutes.
>>> 
>>> Have you faced similar experiences in using Oozie for scheduling Spark 
>>> application jobs ? What alternative workflow tools are you using for 
>>> scheduling Spark jobs on Hadoop ?
>>> 
>>> 
>>> Br,
>>> 
>>> Dennis
>>> 
>>> Von meinem iPhone gesendet
>>> Von meinem iPhone gesendet
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


spark job getting hang

2019-08-05 Thread Amit Sharma
I am running spark job and if i run it sometimes it ran successfully but
most of the time getting

 ERROR Dropping event from queue appStatus. This likely means one of the
listeners is too slow and cannot keep up with the rate at which tasks are
being started by the scheduler.
(org.apache.spark.scheduler.AsyncEventQueue).


Please suggest what  how to debug this issue.


Thanks
Amit


How to programmatically pause and resume Spark/Kafka structured streaming?

2019-08-05 Thread kant kodali
Hi All,

I am trying to see if there is a way to pause a spark stream that process
data from Kafka such that my application can take some actions while the
stream is paused and resume when the application completes those actions.

Thanks!


Re: How to programmatically pause and resume Spark/Kafka structured streaming?

2019-08-05 Thread Gourav Sengupta
Hi,

exactly my question, I was also looking for ways to gracefully exit spark
structured streaming.


Regards,
Gourav

On Tue, Aug 6, 2019 at 3:43 AM kant kodali  wrote:

> Hi All,
>
> I am trying to see if there is a way to pause a spark stream that process
> data from Kafka such that my application can take some actions while the
> stream is paused and resume when the application completes those actions.
>
> Thanks!
>


Hive external table not working in sparkSQL when subdirectories are present

2019-08-05 Thread Rishikesh Gawade
Hi.
I have built a Hive external table on top of a directory 'A' which has data
stored in ORC format. This directory has several subdirectories inside it,
each of which contains the actual ORC files.
These subdirectories are actually created by spark jobs which ingest data
from other sources and write it into this directory.
I tried creating a table and setting the table properties of the same as
*hive.mapred.supports.subdirectories=TRUE* and *mapred.input.dir.recursive*
*=TRUE*.
As a result of this, when i fire the simplest query of *select count(*)
from ExtTable* via the Hive CLI, it successfully gives me the expected
count of records in the table.
However, when i fire the same query via sparkSQL, i get count = 0.

I think the sparkSQL isn't able to descend into the subdirectories for
getting the data while hive is able to do so.
Are there any configurations needed to be set on the spark side so that
this works as it does via hive cli?
I am using Spark on YARN.

Thanks,
Rishikesh

Tags: subdirectories, subdirectory, recursive, recursion, hive external
table, orc, sparksql, yarn