Re: [External] Re: Versioning of Spark Operator

2024-04-11 Thread Ofir Manor
A related question - what is the expected release cadence? At least for the 
next 12-18 months?
Since this is a new subproject, I am personally hoping it would have a faster 
cadence at first, maybe one a month or once every couple of months... If so, 
that would affect versioning.
Also, if it uses semantic versioning, since it is early for the subproject it 
might have a few releases with breaking changes until its own API, defaults, 
behavior becomes stable, so again, having its own versioning might help.
Just my two cents,
   Ofir

From: L. C. Hsieh 
Sent: Wednesday, April 10, 2024 6:14 PM
To: Dongjoon Hyun 
Cc: dev@spark.apache.org 
Subject: [External] Re: Versioning of Spark Operator

This approach makes sense to me.

If Spark K8s operator is aligned with Spark versions, for example, it
uses 4.0.0 now.
Because these JIRA tickets are not actually targeting Spark 4.0.0, it
will cause confusion and more questions, like when we are going to cut
Spark release,
should we include Spark operator JIRAs in the release note, etc.

So I think an independent version number for Spark K8s operator would
be a better option.

If there are no more options or comments, I will create a vote later
to create new "Versions" in Apache Spark JIRA.

Thank you all.

On Wed, Apr 10, 2024 at 12:20 AM Dongjoon Hyun  wrote:
>
> Ya, that would work.
>
> Inevitably, I looked at Apache Flink K8s Operator's JIRA and GitHub repo.
>
> It looks reasonable to me.
>
> Although they share the same JIRA, they choose different patterns per place.
>
> 1. In POM file and Maven Artifact, independent version number.
> 1.8.0
>
> 2. Tag is also based on the independent version number
> https://github.com/apache/flink-kubernetes-operator/tags
> - release-1.8.0
> - release-1.7.0
>
> 3. JIRA Fixed Version is `kubernetes-operator-` prefix.
> https://issues.apache.org/jira/browse/FLINK-34957
> > Fix Version/s: kubernetes-operator-1.9.0
>
> Maybe, we can borrow this pattern.
>
> I guess we need a vote for any further decision because we need to create new 
> `Versions` in Apache Spark JIRA.
>
> Dongjoon.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Decimals

2017-12-25 Thread Ofir Manor
Hi Marco,
great work, I personally hope it gets included soon!
I just wanted to clarify one thing - Oracle and PostgreSQL do not have
infinite precision. The scale and precision of decimals are just
user-defined (explicitly or implicitly).
So, both of them follow the exact same rules you mentioned (like every
other database).
Specifically, both round on loss of precision and both throw an exception
on overflow.
Here is an example:

*PostgreSQL*
postgres=# create table test(i decimal(3,2));
CREATE TABLE
postgres=# insert into test select 1.5;
INSERT 0 1
postgres=# select * from test;
  i
--
 1.60
(1 row)
postgres=# insert into test select -654.123;
ERROR:  numeric field overflow
DETAIL:  A field with precision 3, scale 2 must round to an absolute value
less than 10^1.

*Oracle*
SQL> create table test(i number(3,2));

Table created.

SQL> insert into test select 1.5 from dual;

1 row created.

SQL> select * from test;

 I
--
   1.6

SQL> insert into test select -654.123 from dual;
insert into test select -654.123 from dual
*
ERROR at line 1:
ORA-01438: value larger than specified precision allowed for this column

I hope that helps, it strengthens your point!

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Fri, Dec 22, 2017 at 1:11 PM, Marco Gaido <marcogaid...@gmail.com> wrote:

> Hi Xiao, hi all,
>
> I checked the DB2 documentation for which you provided me a link in the PR
> (thanks for it), and what you are stating is not really right.
> DB2, in compliance with the SQL standards, throws an exception if an
> overflow occurs, ie. if a loss of significant digits is necessary to
> properly represent the value, which is the case I discussed as point 3 of
> the previous e-mail. Since DB2 has a maximum precision of 31, while the
> other DBs have an higher precision (in particular SQLServer and Hive as
> Spark have 38 as maximum precision), the same operation running fine on
> Hive or SQLServer (or Spark after my PR) may throw an exception on DB2 (but
> this is due to overflow, ie. point 3, not to loss of precision).
>
> In the case of loss of precision, in compliance with SQL standards, DB2
> performs truncation. (emitting just a warning) And it can truncate more
> than us, since we are ensuring to save at least 6 digits for the scale,
> while DB2 has a minimum of 0 for the scale (refer to
> https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/
> sqlref/src/tpc/db2z_decimalmultiplication.html). I'm citing the relevant
> part here for convenience:
>
> The truncated copy has a scale of MAX(0,S-(P-15)), where P and S are the
>> original precision and scale. If, in the process of truncation, one or more
>> nonzero digits are removed, SQLWARN7 in SQLCA is set to W, indicating loss
>> of precision.
>>
>
> Moreover, the rules applied by DB2 to determine precision are scale are
> analogous with the one used by Hive and SQLServer. The only real practical
> difference is that we are enforcing to keep at least 6 as value for scale,
> while DB2 has 0 as minimum (which is even worse than our approach according
> to your previous e-mail about the financial use case).
>
> For the brave ones who went on reading so far, I'll summarize the
> situation of the 3 point above, adding DB2 to the comparison:
>
>  1. *Rules to determine precision and scale*
>  - *Hive, SQLServer (and Spark after the PR)*: I won't include the
> exact formulas, anyway the relevant part is that in case of precision
> higher that the maximum value, we use the maximum available value (38) as
> precision and the maximum between the needed scale (computing according the
> relevant formula) and a minimum value guaranteed for the scale which is 6.
>  - *DB2*: practically same rules as above. Main difference are: the
> maximum precision is 31 and it doesn't enforce any minimum value for the
> scale (or the minimum value guaranteed for the scale is 0).
>  - *Postgres and Oracle*: NA, they have nearly infinite precision...
>  - *SQL ANSI 2011*: no indication
>  - *Spark now*: if the precision needed is more than 38, use 38 as
> precision; use the needed scale without any adjustment.
>
>   2. *Behavior in case of precision loss but result in the range of the
> representable values*
>  - *Hive, SQLServer (and Spark after the PR)*: round the result.
>  - *DB2*: truncates the result (and emits a warning).
>  - *Postgres and Oracle*: NA, they have nearly infinite precision...
>  - *SQL ANSI 2011*: either truncate or round the value.
>  - *Spark now*: returns NULL.
>
>   3. *Behavior in case of result out of the range of the representable
> values (i.e overflow)*
>  - *DB2, **SQLServer*: throw a

Re: Does anyone know how to build spark with scala12.4?

2017-11-28 Thread Ofir Manor
Hi,
as far as I know, Spark does not support Scala 2.12.
There is on-going work to make refactor / fix Spark source code to support
Scala 2.12 - look for multiple emails on this list in the last months from
Sean Owen on his progress.
Once Spark supports Scala 2.12, I think the next target would be JDK 9
support.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Tue, Nov 28, 2017 at 9:20 AM, Zhang, Liyun <liyun.zh...@intel.com> wrote:

> Hi all:
>
>   Does anyone know how to build spark with scala12.4? I want to test
> whether spark can work on jdk9 or not.  Scala12.4 supports jdk9.  Does
> anyone try to build spark with scala 12.4 or compile successfully with
> jdk9.Appreciate to get some feedback from you.
>
>
>
>
>
> Best Regards
>
> Kelly Zhang/Zhang,Liyun
>
>
>


Re: FYI - Kafka's built-in performance test tool

2017-06-01 Thread Ofir Manor
Hi,
sorry for that, I sent my original email to this list by mistake (gmail
autocomplete fooled me), the page I linked isn't open to buplic.

Anyway, since you are interested, here is the sample commands and output
from a VirtualBox image on my laptop.

1. Create a topic

kafka-topics.sh --create   --zookeeper localhost:2181 --partitions 2
--replication-factor 1 --topic perf_test_2_partitions


2. Write two millions messages, each 1KB, at a target rate of 50MB/s:

$  kafka-producer-perf-test.sh --topic perf_test_2_partitions
--num-records 200 --record-size 1024 --throughput 5
--producer-props bootstrap.servers=localhost:9092
249934 records sent, 49986.8 records/sec (48.82 MB/sec), 85.7 ms avg
latency, 415.0 max latency.
250142 records sent, 50028.4 records/sec (48.86 MB/sec), 2.2 ms avg
latency, 33.0 max latency.
250029 records sent, 50005.8 records/sec (48.83 MB/sec), 4.6 ms avg
latency, 74.0 max latency.
249997 records sent, 4.4 records/sec (48.83 MB/sec), 16.2 ms avg
latency, 98.0 max latency.
249902 records sent, 49980.4 records/sec (48.81 MB/sec), 5.7 ms avg
latency, 95.0 max latency.
249568 records sent, 49893.6 records/sec (48.72 MB/sec), 1.8 ms avg
latency, 21.0 max latency.
248190 records sent, 49638.0 records/sec (48.47 MB/sec), 10.9 ms avg
latency, 158.0 max latency.
200 records sent, 49993.750781 records/sec (48.82 MB/sec), 17.06
ms avg latency, 415.00 ms max latency, 2 ms 50th, 81 ms 95th, 343 ms
99th, 407 ms 99.9th


3. Read ten million messages (each 1KB), at a rate of 231MB/s (it was
physical I/O - saw with iostat command that the SSD did work at that
rate):

(I ran the previous command several times, so the topic did have 10M messages)

$ kafka-consumer-perf-test.sh --topic perf_test_2_partitions
--messages 1000 --broker-list localhost:9092 --date-format
HH:mm:ss:SSS

start.time, end.time, data.consumed.in.MB, MB.sec,
data.consumed.in.nMsg, nMsg.sec
19:32:53:408, 19:33:35:523, 9765.6250, 231.8800, 1000, 237445.0908


If you / someone has followup questions, might be best to do it off
Spark dev list...


Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Thu, Jun 1, 2017 at 7:25 AM, 郭健 <guo.j...@immomo.com> wrote:

> It seems an internal page so I cannot access it:
>
> Your email address  doesn't have access to *equalum.atlassian.net
> <http://equalum.atlassian.net>*
>
>
>
> *发件人**: *Ofir Manor <ofir.ma...@equalum.io>
> *日期**: *2017年5月26日 星期五 01:12
> *至**: *dev <dev@spark.apache.org>
> *主**题**: *FYI - Kafka's built-in performance test tool
>
>
>
> comes with source code.
> Some basic results from the VM,
>
>- Write every second 50K-60K messages, each 1KB (total 50MB-60MB)
>- Read every second more than 200K messages, each 1KB.
>
> May help in assessing whether any Kafka-related slowness is Kafka
> limitation or our implementation.
>
>
>
>https://equalum.atlassian.net/wiki/display/EQ/Kafka+
> performance+test+tool
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>


Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-05-25 Thread Ofir Manor
Reynold,
my point is that Spark should aim to follow the SQL standard instead of
rolling its own type system.
If I understand correctly, the existing implementation is similar to
TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle..
In addition, there are the standard TIMESTAMP and TIMESTAMP WITH TIMEZONE
data types which are missing from Spark.
So, it is better (for me) if instead of extending the existing types, Spark
would just implement the additional well-defined types properly.
Just trying to copy-paste CREATE TABLE between SQL engines should not be an
exercise of flags and incompatibilities.

Regarding the current behaviour, if I remember correctly I had to force our
spark O/S user into UTC so Spark wont change my timestamps.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Thu, May 25, 2017 at 1:33 PM, Reynold Xin <r...@databricks.com> wrote:

> Zoltan,
>
> Thanks for raising this again, although I'm a bit confused since I've
> communicated with you a few times on JIRA and on private emails to explain
> that you have some misunderstanding of the timestamp type in Spark and some
> of your statements are wrong (e.g. the except text file part). Not sure why
> you didn't get any of those.
>
>
> Here's another try:
>
>
> 1. I think you guys misunderstood the semantics of timestamp in Spark
> before session local timezone change. IIUC, Spark has always assumed
> timestamps to be with timezone, since it parses timestamps with timezone
> and does all the datetime conversions with timezone in mind (it doesn't
> ignore timezone if a timestamp string has timezone specified). The session
> local timezone change further pushes Spark to that direction, but the
> semantics has been with timezone before that change. Just run Spark on
> machines with different timezone and you will know what I'm talking about.
>
> 2. CSV/Text is not different. The data type has always been "with
> timezone". If you put a timezone in the timestamp string, it parses the
> timezone.
>
> 3. We can't change semantics now, because it'd break all existing Spark
> apps.
>
> 4. We can however introduce a new timestamp without timezone type, and
> have a config flag to specify which one (with tz or without tz) is the
> default behavior.
>
>
>
> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi <z...@cloudera.com> wrote:
>
>> Hi,
>>
>> Sorry if you receive this mail twice, it seems that my first attempt did
>> not make it to the list for some reason.
>>
>> I would like to start a discussion about SPARK-18350
>> <https://issues.apache.org/jira/browse/SPARK-18350> before it gets
>> released because it seems to be going in a different direction than what
>> other SQL engines of the Hadoop stack do.
>>
>> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT TIME
>> ZONE) to have timezone-agnostic semantics - basically a type that expresses
>> readings from calendars and clocks and is unaffected by time zone. In the
>> Hadoop stack, Impala has always worked like this and recently Presto also
>> took steps <https://github.com/prestodb/presto/issues/7122> to become
>> standards compliant. (Presto's design doc
>> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
>> also contains a great summary of the different semantics.) Hive has a
>> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major
>> source of incompatibility that is already being addressed
>> <https://issues.apache.org/jira/browse/HIVE-12767>). A TIMESTAMP in
>> SparkSQL, however, has UTC-normalized local time semantics (except for
>> textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE
>> type.
>>
>> Given that timezone-agnostic TIMESTAMP semantics provide standards
>> compliance and consistency with most SQL engines, I was wondering whether
>> SparkSQL should also consider it in order to become ANSI SQL compliant and
>> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL
>> adapt this semantics in the future, SPARK-18350
>> <https://issues.apache.org/jira/browse/SPARK-18350> may turn out to be a
>> source of problems. Please correct me if I'm wrong, but this change seems
>> to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP
>> type. I think SPARK-18350 would be a great feature for a separate TIMESTAMP
>> WITH TIME ZONE type, but the plain unqualified TIMESTAMP type would be
>> better becoming timezone-agnostic instead of gaining further timezone-aware
>> capabilities. (Of course becoming timezone-agnostic would be a behavior
>

FYI - Kafka's built-in performance test tool

2017-05-25 Thread Ofir Manor
comes with source code.
Some basic results from the VM,

   - Write every second 50K-60K messages, each 1KB (total 50MB-60MB)
   - Read every second more than 200K messages, each 1KB.

May help in assessing whether any Kafka-related slowness is Kafka
limitation or our implementation.

   https://equalum.atlassian.net/wiki/display/EQ/Kafka+performance+test+tool

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io


Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-05-25 Thread Ofir Manor
Hi Zoltan,
thanks for bringing this up, this is really important to me!
Personally, as a user developing app on top of Spark and other tools, the
current timestamp semantics has been a source of some pain - needing to
undo Spark's "auto-correcting" of timestamps .
It would be really great if we could have standard timestamp handling, like
every other SQL-compliant database and processing engine (choosing between
the two main SQL types). I was under the impression that better SQL
compliant was one of the top priorities of the Spark project.
I guess it is pretty lake in the release cycle - but it seems SPARK-18350
was just introduced a couple of weeks ago. Maybe it should be reverted to
unblock the 2.2 release, and a more proper solution could be implemented
for the next release after a more comprehensive discussion?
Just my two cents,

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Wed, May 24, 2017 at 6:46 PM, Zoltan Ivanfi <z...@cloudera.com> wrote:

> Hi,
>
> Sorry if you receive this mail twice, it seems that my first attempt did
> not make it to the list for some reason.
>
> I would like to start a discussion about SPARK-18350
> <https://issues.apache.org/jira/browse/SPARK-18350> before it gets
> released because it seems to be going in a different direction than what
> other SQL engines of the Hadoop stack do.
>
> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT TIME
> ZONE) to have timezone-agnostic semantics - basically a type that expresses
> readings from calendars and clocks and is unaffected by time zone. In the
> Hadoop stack, Impala has always worked like this and recently Presto also
> took steps <https://github.com/prestodb/presto/issues/7122> to become
> standards compliant. (Presto's design doc
> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
> also contains a great summary of the different semantics.) Hive has a
> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major
> source of incompatibility that is already being addressed
> <https://issues.apache.org/jira/browse/HIVE-12767>). A TIMESTAMP in
> SparkSQL, however, has UTC-normalized local time semantics (except for
> textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE
> type.
>
> Given that timezone-agnostic TIMESTAMP semantics provide standards
> compliance and consistency with most SQL engines, I was wondering whether
> SparkSQL should also consider it in order to become ANSI SQL compliant and
> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL
> adapt this semantics in the future, SPARK-18350
> <https://issues.apache.org/jira/browse/SPARK-18350> may turn out to be a
> source of problems. Please correct me if I'm wrong, but this change seems
> to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP
> type. I think SPARK-18350 would be a great feature for a separate TIMESTAMP
> WITH TIME ZONE type, but the plain unqualified TIMESTAMP type would be
> better becoming timezone-agnostic instead of gaining further timezone-aware
> capabilities. (Of course becoming timezone-agnostic would be a behavior
> change, so it must be optional and configurable by the user, as in Presto.)
>
> I would like to hear your opinions about this concern and about TIMESTAMP
> semantics in general. Does the community agree that a standards-compliant
> and interoperable TIMESTAMP type is desired? Do you perceive SPARK-18350 as
> a potential problem in achieving this or do I misunderstand the effects of
> this change?
>
> Thanks,
>
> Zoltan
>
> ---
>
> List of links in case in-line links do not work:
>
>-
>
>SPARK-18350: https://issues.apache.org/jira/browse/SPARK-18350
>-
>
>Presto's change: https://github.com/prestodb/presto/issues/7122
>-
>
>Presto's design doc: https://docs.google.com/document/d/
>1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit
>
> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
>
>
>


Re: [ANNOUNCE] Apache Spark 2.1.1

2017-05-03 Thread Ofir Manor
Looking good...
one small things - the documentation on the web site is still 2.1.0
Specifically, the home page has a link (under Documentation menu) labeled
Latest Release (Spark 2.1.1), but when I click it, I get the 2.1.0
documentation.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Wed, May 3, 2017 at 1:18 AM, Michael Armbrust <mich...@databricks.com>
wrote:

> We are happy to announce the availability of Spark 2.1.1!
>
> Apache Spark 2.1.1 is a maintenance release, based on the branch-2.1
> maintenance branch of Spark. We strongly recommend all 2.1.x users to
> upgrade to this stable release.
>
> To download Apache Spark 2.1.1 visit http://spark.apache.org/
> downloads.html
>
> We would like to acknowledge all community members for contributing
> patches to this release.
>


Re: structured streaming and window functions

2016-11-17 Thread Ofir Manor
I agree with you, I think that once we will have sessionization, we could
aim for richer processing capabilities per session. As far as I image it, a
session is an ordered sequence of data, that we could apply computation on
it (like CEP).


Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Thu, Nov 17, 2016 at 5:16 PM, assaf.mendelson <assaf.mendel...@rsa.com>
wrote:

> It is true that this is sessionizing but I brought it as an example for
> finding an ordered pattern in the data.
>
> In general, using simple window (e.g. 24 hours) in structured streaming is
> explain in the grouping by time and is very clear.
>
> What I was trying to figure out is how to do streaming of cases where you
> actually have to have some sorting to find patterns, especially when some
> of the data may come in late.
>
> I was trying to figure out if there is plan to support this and if so,
> what would be the performance implications.
>
> Assaf.
>
>
>
> *From:* Ofir Manor [via Apache Spark Developers List] [mailto:ml-node+[hidden
> email] <http:///user/SendEmail.jtp?type=node=19936=0>]
> *Sent:* Thursday, November 17, 2016 5:13 PM
> *To:* Mendelson, Assaf
> *Subject:* Re: structured streaming and window functions
>
>
>
> Assaf, I think what you are describing is actually sessionizing, by user,
> where a session is ended by a successful login event.
>
> On each session, you want to count number of failed login events.
>
> If so, this is tracked by https://issues.apache.org/
> jira/browse/SPARK-10816 (didn't start yet)
>
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: tel:%2B972-54-7801286"
> value="+972507470820" target="_blank">+972-54-7801286 | Email: [hidden
> email] <http:///user/SendEmail.jtp?type=node=19935=0>
>
>
>
> On Thu, Nov 17, 2016 at 2:52 PM, assaf.mendelson <[hidden email]
> <http:///user/SendEmail.jtp?type=node=19935=1>> wrote:
>
> Is there a plan to support sql window functions?
>
> I will give an example of use: Let’s say we have login logs. What we want
> to do is for each user we would want to add the number of failed logins for
> each successful login. How would you do it with structured streaming?
>
> As this is currently not supported, is there a plan on how to support it
> in the future?
>
> Assaf.
>
>
>
> *From:* Herman van Hövell tot Westerflier-2 [via Apache Spark Developers
> List] [mailto:[hidden email]
> <http:///user/SendEmail.jtp?type=node=19935=2>[hidden email]
> <http://user/SendEmail.jtp?type=node=19934=0>]
> *Sent:* Thursday, November 17, 2016 1:27 PM
> *To:* Mendelson, Assaf
> *Subject:* Re: structured streaming and window functions
>
>
>
> What kind of window functions are we talking about? Structured streaming
> only supports time window aggregates, not the more general sql window
> function (sum(x) over (partition by ... order by ...)) aggregates.
>
>
>
> The basic idea is that you use incremental aggregation and store the
> aggregation buffer (not the end result) in a state store after each
> increment. When an new batch comes in, you perform aggregation on that
> batch, merge the result of that aggregation with the buffer in the state
> store, update the state store and return the new result.
>
>
>
> This is much harder than it sounds, because you need to maintain state in
> a fault tolerant way and you need to have some eviction policy (watermarks
> for instance) for aggregation buffers to prevent the state store from
> reaching an infinite size.
>
>
>
> On Thu, Nov 17, 2016 at 12:19 AM, assaf.mendelson <[hidden email]
> <http://user/SendEmail.jtp?type=node=19933=0>> wrote:
>
> Hi,
>
> I have been trying to figure out how structured streaming handles window
> functions efficiently.
>
> The portion I understand is that whenever new data arrived, it is grouped
> by the time and the aggregated data is added to the state.
>
> However, unlike operations like sum etc. window functions need the
> original data and can change when data arrives late.
>
> So if I understand correctly, this would mean that we would have to save
> the original data and rerun on it to calculate the window function every
> time new data arrives.
>
> Is this correct? Are there ways to go around this issue?
>
>
>
> Assaf.
>
>
> --
>
> View this message in context: structured streaming and window functions
> <http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930.html>
> Sent from the Apache Spark Developers List mailing list arch

Re: structured streaming and window functions

2016-11-17 Thread Ofir Manor
Assaf, I think what you are describing is actually sessionizing, by user,
where a session is ended by a successful login event.
On each session, you want to count number of failed login events.
If so, this is tracked by https://issues.apache.org/jira/browse/SPARK-10816
(didn't start yet)

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Thu, Nov 17, 2016 at 2:52 PM, assaf.mendelson <assaf.mendel...@rsa.com>
wrote:

> Is there a plan to support sql window functions?
>
> I will give an example of use: Let’s say we have login logs. What we want
> to do is for each user we would want to add the number of failed logins for
> each successful login. How would you do it with structured streaming?
>
> As this is currently not supported, is there a plan on how to support it
> in the future?
>
> Assaf.
>
>
>
> *From:* Herman van Hövell tot Westerflier-2 [via Apache Spark Developers
> List] [mailto:ml-node+[hidden email]
> <http:///user/SendEmail.jtp?type=node=19934=0>]
> *Sent:* Thursday, November 17, 2016 1:27 PM
> *To:* Mendelson, Assaf
> *Subject:* Re: structured streaming and window functions
>
>
>
> What kind of window functions are we talking about? Structured streaming
> only supports time window aggregates, not the more general sql window
> function (sum(x) over (partition by ... order by ...)) aggregates.
>
>
>
> The basic idea is that you use incremental aggregation and store the
> aggregation buffer (not the end result) in a state store after each
> increment. When an new batch comes in, you perform aggregation on that
> batch, merge the result of that aggregation with the buffer in the state
> store, update the state store and return the new result.
>
>
>
> This is much harder than it sounds, because you need to maintain state in
> a fault tolerant way and you need to have some eviction policy (watermarks
> for instance) for aggregation buffers to prevent the state store from
> reaching an infinite size.
>
>
>
> On Thu, Nov 17, 2016 at 12:19 AM, assaf.mendelson <[hidden email]
> <http:///user/SendEmail.jtp?type=node=19933=0>> wrote:
>
> Hi,
>
> I have been trying to figure out how structured streaming handles window
> functions efficiently.
>
> The portion I understand is that whenever new data arrived, it is grouped
> by the time and the aggregated data is added to the state.
>
> However, unlike operations like sum etc. window functions need the
> original data and can change when data arrives late.
>
> So if I understand correctly, this would mean that we would have to save
> the original data and rerun on it to calculate the window function every
> time new data arrives.
>
> Is this correct? Are there ways to go around this issue?
>
>
>
> Assaf.
>
>
> --
>
> View this message in context: structured streaming and window functions
> <http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>
>
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/structured-
> streaming-and-window-functions-tp19930p19933.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] <http:///user/SendEmail.jtp?type=node=19934=1>
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
> --
> View this message in context: RE: structured streaming and window
> functions
> <http://apache-spark-developers-list.1001551.n3.nabble.com/structured-streaming-and-window-functions-tp19930p19934.html>
>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>


Re: Straw poll: dropping support for things like Scala 2.10

2016-10-27 Thread Ofir Manor
I totally agree with Sean, just a small correction:
Java 7 and Python 2.6 are already deprecated since Spark 2.0 (after a
lengthy discussion), so there is no need to discuss whether they should
become deprecated in 2.1
  http://spark.apache.org/releases/spark-release-2-0-0.html#deprecations
The discussion is whether Scala 2.10 should also be marked as deprecated
(no one is objecting that), and more importantly, when to actually move
from deprecation to actually dropping support for any combination of JDK /
Scala / Hadoop / Python.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Fri, Oct 28, 2016 at 12:13 AM, Sean Owen <so...@cloudera.com> wrote:

> The burden may be a little more apparent when dealing with the day to day
> merging and fixing of breaks. The upside is maybe the more compelling
> argument though. For example, lambda-fying all the Java code, supporting
> java.time, and taking advantage of some newer Hadoop/YARN APIs is a
> moderate win for users too, and there's also a cost to not doing that.
>
> I must say I don't see a risk of fragmentation as nearly the problem it's
> made out to be here. We are, after all, here discussing _beginning_ to
> remove support _in 6 months_, for long since non-current versions of
> things. An org's decision to not, say, use Java 8 is a decision to not use
> the new version of lots of things. It's not clear this is a constituency
> that is either large or one to reasonably serve indefinitely.
>
> In the end, the Scala issue may be decisive. Supporting 2.10 - 2.12
> simultaneously is a bridge too far, and if 2.12 requires Java 8, it's a
> good reason to for Spark to require Java 8. And Steve suggests that means a
> minimum of Hadoop 2.6 too. (I still profess ignorance of the Python part of
> the issue.)
>
> Put another way I am not sure what the criteria is, if not the above?
>
> I support deprecating all of these things, at the least, in 2.1.0.
> Although it's a separate question, I believe it's going to be necessary to
> remove support in ~6 months in 2.2.0.
>
>
> On Thu, Oct 27, 2016 at 4:36 PM Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>
>> Just to comment on this, I'm generally against removing these types of
>> things unless they create a substantial burden on project contributors. It
>> doesn't sound like Python 2.6 and Java 7 do that yet -- Scala 2.10 might,
>> but then of course we need to wait for 2.12 to be out and stable.
>>
>> In general, this type of stuff only hurts users, and doesn't have a huge
>> impact on Spark contributors' productivity (sure, it's a bit unpleasant,
>> but that's life). If we break compatibility this way too quickly, we
>> fragment the user community, and then either people have a crappy
>> experience with Spark because their corporate IT doesn't yet have an
>> environment that can run the latest version, or worse, they create more
>> maintenance burden for us because they ask for more patches to be
>> backported to old Spark versions (1.6.x, 2.0.x, etc). Python in particular
>> is pretty fundamental to many Linux distros.
>>
>> In the future, rather than just looking at when some software came out,
>> it may be good to have some criteria for when to drop support for
>> something. For example, if there are really nice libraries in Python 2.7 or
>> Java 8 that we're missing out on, that may be a good reason. The
>> maintenance burden for multiple Scala versions is definitely painful but I
>> also think we should always support the latest two Scala releases.
>>
>> Matei
>>
>> On Oct 27, 2016, at 12:15 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>> I created a JIRA ticket to track this: https://issues.apache.
>> org/jira/browse/SPARK-18138
>>
>>
>>
>> On Thu, Oct 27, 2016 at 10:19 AM, Steve Loughran <ste...@hortonworks.com>
>> wrote:
>>
>>
>> On 27 Oct 2016, at 10:03, Sean Owen <so...@cloudera.com> wrote:
>>
>> Seems OK by me.
>> How about Hadoop < 2.6, Python 2.6? Those seem more removeable. I'd like
>> to add that to a list of things that will begin to be unsupported 6 months
>> from now.
>>
>>
>> If you go to java 8 only, then hadoop 2.6+ is mandatory.
>>
>>
>> On Wed, Oct 26, 2016 at 8:49 PM Koert Kuipers <ko...@tresata.com> wrote:
>>
>> that sounds good to me
>>
>> On Wed, Oct 26, 2016 at 2:26 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>> We can do the following concrete proposal:
>>
>> 1. Plan to remove support for Java 7 / Scala 2.10 in Spark 2.2.0 (Mar/Apr
>> 2017).
>>
>> 2. In Spark 2.1.0 release, aggressively and explicitly announce the
>> deprecation of Java 7 / Scala 2.10 support.
>>
>> (a) It should appear in release notes, documentations that mention how to
>> build Spark
>>
>> (b) and a warning should be shown every time SparkContext is started
>> using Scala 2.10 or Java 7.
>>
>>
>>
>>
>>


Re: Watermarking in Structured Streaming to drop late data

2016-10-27 Thread Ofir Manor
Assaf,
I think you are using the term "window" differently than Structured
Streaming,... Also, you didn't consider groupBy. Here is an example:
I want to maintain, for every minute over the last six hours, a computation
(trend or average or stddev) on a five-minute window (from t-4 to t). So,
1. My window size is 5 minutes
2. The window slides every 1 minute (so, there is a new 5-minute window for
every minute)
3. Old windows should be purged if they are 6 hours old (based on event
time vs. clock?)
Option 3 is currently missing - the streaming job keeps all windows
forever, as the app may want to access very old windows, unless it would
explicitly say otherwise.


Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Thu, Oct 27, 2016 at 9:46 AM, assaf.mendelson <assaf.mendel...@rsa.com>
wrote:

> Hi,
>
> Should comments come here or in the JIRA?
>
> Any, I am a little confused on the need to expose this as an API to begin
> with.
>
> Let’s consider for a second the most basic behavior: We have some input
> stream and we want to aggregate a sum over a time window.
>
> This means that the window we should be looking at would be the maximum
> time across our data and back by the window interval. Everything older can
> be dropped.
>
> When new data arrives, the maximum time cannot move back so we generally
> drop everything tool old.
>
> This basically means we save only the latest time window.
>
> This simpler model would only break if we have a secondary aggregation
> which needs the results of multiple windows.
>
> Is this the use case we are trying to solve?
>
> If so, wouldn’t just calculating the bigger time window across the entire
> aggregation solve this?
>
> Am I missing something here?
>
>
>
> *From:* Michael Armbrust [via Apache Spark Developers List] [mailto:
> ml-node+[hidden email]
> <http:///user/SendEmail.jtp?type=node=19591=0>]
> *Sent:* Thursday, October 27, 2016 3:04 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Watermarking in Structured Streaming to drop late data
>
>
>
> And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124
>
>
>
> On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden email]
> <http:///user/SendEmail.jtp?type=node=19590=0>> wrote:
>
> Hey all,
>
>
>
> We are planning implement watermarking in Structured Streaming that would
> allow us handle late, out-of-order data better. Specially, when we are
> aggregating over windows on event-time, we currently can end up keeping
> unbounded amount data as state. We want to define watermarks on the event
> time in order mark and drop data that are "too late" and accordingly age
> out old aggregates that will not be updated any more.
>
>
>
> To enable the user to specify details like lateness threshold, we are
> considering adding a new method to Dataset. We would like to get more
> feedback on this API. Here is the design doc
>
>
>
> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6Z
> LIS03xhkfCQ/
>
>
>
> Please comment on the design and proposed APIs.
>
>
>
> Thank you very much!
>
>
>
> TD
>
>
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-
> Structured-Streaming-to-drop-late-data-tp19589p19590.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] <http:///user/SendEmail.jtp?type=node=19591=1>
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
> --
> View this message in context: RE: Watermarking in Structured Streaming to
> drop late data
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19591.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>


Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Ofir Manor
I think that 2.1 should include a visible deprecation message about Java 7,
Scala 2.10 and older Hadoop versions (plus python if there is a consensus
on that), to give users / admins early warning, followed by dropping them
from trunk for 2.2 once 2.1 is released.
Personally, we use only Scala 2.11 on JDK8.
Cody - Scala 2.12 will likely be released before Spark 2.1, maybe even
later this week: http://scala-lang.org/news/2.12.0-RC2

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Tue, Oct 25, 2016 at 7:28 PM, Cody Koeninger <c...@koeninger.org> wrote:

> I think only supporting 1 version of scala at any given time is not
> sufficient, 2 probably is ok.
>
> I.e. don't drop 2.10 before 2.12 is out + supported
>
> On Tue, Oct 25, 2016 at 10:56 AM, Sean Owen <so...@cloudera.com> wrote:
> > The general forces are that new versions of things to support emerge, and
> > are valuable to support, but have some cost to support in addition to old
> > versions. And the old versions become less used and therefore less
> valuable
> > to support, and at some point it tips to being more cost than value. It's
> > hard to judge these costs and benefits.
> >
> > Scala is perhaps the trickiest one because of the general mutual
> > incompatibilities across minor versions. The cost of supporting multiple
> > versions is high, and a third version is about to arrive. That's probably
> > the most pressing question. It's actually biting with some regularity
> now,
> > with compile errors on 2.10.
> >
> > (Python I confess I don't have an informed opinion about.)
> >
> > Java, Hadoop are not as urgent because they're more backwards-compatible.
> > Anecdotally, I'd be surprised if anyone today would "upgrade" to Java 7
> or
> > an old Hadoop version. And I think that's really the question. Even if
> one
> > decided to drop support for all this in 2.1.0, it would not mean people
> > can't use Spark with these things. It merely means they can't necessarily
> > use Spark 2.1.x. This is why we have maintenance branches for 1.6.x,
> 2.0.x.
> >
> > Tying Scala 2.11/12 support to Java 8 might make sense.
> >
> > In fact, I think that's part of the reason that an update in master,
> perhaps
> > 2.1.x, could be overdue, because it actually is just the beginning of the
> > end of the support burden. If you want to stop dealing with these in ~6
> > months they need to stop being supported in minor branches by right about
> > now.
> >
> >
> >
> >
> > On Tue, Oct 25, 2016 at 4:47 PM Mark Hamstra <m...@clearstorydata.com>
> > wrote:
> >>
> >> What's changed since the last time we discussed these issues, about 7
> >> months ago?  Or, another way to formulate the question: What are the
> >> threshold criteria that we should use to decide when to end Scala 2.10
> >> and/or Java 7 support?
> >>
> >> On Tue, Oct 25, 2016 at 8:36 AM, Sean Owen <so...@cloudera.com> wrote:
> >>>
> >>> I'd like to gauge where people stand on the issue of dropping support
> for
> >>> a few things that were considered for 2.0.
> >>>
> >>> First: Scala 2.10. We've seen a number of build breakages this week
> >>> because the PR builder only tests 2.11. No big deal at this stage,
> but, it
> >>> did cause me to wonder whether it's time to plan to drop 2.10 support,
> >>> especially with 2.12 coming soon.
> >>>
> >>> Next, Java 7. It's reasonably old and out of public updates at this
> >>> stage. It's not that painful to keep supporting, to be honest. It would
> >>> simplify some bits of code, some scripts, some testing.
> >>>
> >>> Hadoop versions: I think the the general argument is that most anyone
> >>> would be using, at the least, 2.6, and it would simplify some code
> that has
> >>> to reflect to use not-even-that-new APIs. It would remove some moderate
> >>> complexity in the build.
> >>>
> >>>
> >>> "When" is a tricky question. Although it's a little aggressive for
> minor
> >>> releases, I think these will all happen before 3.x regardless. 2.1.0
> is not
> >>> out of the question, though coming soon. What about ... 2.2.0?
> >>>
> >>>
> >>> Although I tend to favor dropping support, I'm mostly asking for
> current
> >>> opinions.
> >>
> >>
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: StructuredStreaming status

2016-10-19 Thread Ofir Manor
Thanks a lot Michael! I really appreciate your sharing.
Logistically, I suggest to find a way to tag all structured streaming
JIRAs, so it wouldn't so hard to look for them, for anyone wanting to
participate, and also have something like the ML roadmap JIRA.
regarding your list, evicting space seems very important. If I understand
correctly, currently state grows forever (when using windows), so it is
impractical to run a long-running streaming job with decent state. It would
be great if user could bound the state by event time (it is also very
natural).
I personally see sessionization as lower priority (seems like a niche
requirement). To me, supporting only a single stream of events that can
only be joined to static datasets makes building anything but the simplest
of short-running streaming jobs problematic (all interesting datasets
change over time). Also, the promise of interactive queries on top of a
computed, live dataset likely has a wider appeal (as it was presented since
early this year as one of the goals of structured streaming). Also making
the sources and sinks API nicer to third-party developers to encourage
adoption and plugins, or beefing up the list of builtin exactly-once
sources and sinks (maybe also have a pluggable state store, as I've seen
some wanting, which may better enable interactive queries).
In addition, I think you should really identify what needs to be done to
make this API stable and focus on that. I think that for adoption, you'll
need to be clear on the full list of gaps / gotchas, and clearly
communicate the project priorities / target timeline (again, just like ML
does it), hopefully after some community discussion...

On a personal note, I'm quite surprised that this is all the progress in
Structured Streaming over the last three months since 2.0 was released. I
was under the impression that this was one of the biggest things that the
Spark community actively works on, but that is clearly not the case, given
that most of the activity is a couple of (very important) JIRAs from the
last several weeks. Not really sure how to parse that yet...
I think having some clearer, prioritized roadmap going forward will be a
good first to recalibrate expectations for 2.2 and for graduating from an
alpha state. But especially, I think you guys seriously needs to figure out
what's the bottleneck here (lack of dedicated owner? lack of commiters
focusing on it?) and just fix it (recruit new commiters to work on it?) to
have a competitive streaming offering in a few quarters.

Just my two cents,

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Wed, Oct 19, 2016 at 10:45 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Anything that is actively being designed should be in JIRA, and it seems
> like you found most of it.  In general, release windows can be found on the
> wiki <https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage>.
>
> 2.1 has a lot of stability fixes as well as the kafka support you
> mentioned.  It may also include some of the following.
>
> The items I'd like to start thinking about next are:
>  - Evicting state from the store based on event time watermarks
>  - Sessionization (grouping together related events by key / eventTime)
>  - Improvements to the query planner (remove some of the restrictions on
> what queries can be run).
>
> This is roughly in order based on what I've been hearing users hit the
> most.  Would love more feedback on what is blocking real use cases.
>
> On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor <ofir.ma...@equalum.io> wrote:
>
>> Hi,
>> I hope it is the right forum.
>> I am looking for some information of what to expect from
>> StructuredStreaming in its next releases to help me choose when / where to
>> start using it more seriously (or where to invest in workarounds and where
>> to wait). I couldn't find a good place where such planning discussed for
>> 2.1  (like, for example ML and SPARK-15581).
>> I'm aware of the 2.0 documented limits (http://spark.apache.org/docs/
>> 2.0.1/structured-streaming-programming-guide.html#unsupported-operations),
>> like no support for multiple aggregations levels, joins are strictly to a
>> static dataset (no SCD or stream-stream) etc, limited sources / sinks (like
>> no sink for interactive queries) etc etc
>> I'm also aware of some changes that have landed in master, like the new
>> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
>> metrics in SPARK-17731, and some improvements for the file source.
>> If I remember correctly, the discussion on Spark release cadence
>> concluded with a preference to a four-month cycles, with likely code freeze
>> pretty soon (end of October). So I believe the scope for 2.1 should likely
>> quit

StructuredStreaming status

2016-10-18 Thread Ofir Manor
Hi,
I hope it is the right forum.
I am looking for some information of what to expect from
StructuredStreaming in its next releases to help me choose when / where to
start using it more seriously (or where to invest in workarounds and where
to wait). I couldn't find a good place where such planning discussed for
2.1  (like, for example ML and SPARK-15581).
I'm aware of the 2.0 documented limits (
http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations),
like no support for multiple aggregations levels, joins are strictly to a
static dataset (no SCD or stream-stream) etc, limited sources / sinks (like
no sink for interactive queries) etc etc
I'm also aware of some changes that have landed in master, like the new
Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
metrics in SPARK-17731, and some improvements for the file source.
If I remember correctly, the discussion on Spark release cadence concluded
with a preference to a four-month cycles, with likely code freeze pretty
soon (end of October). So I believe the scope for 2.1 should likely quite
clear to some, and that 2.2 planning should likely be starting about now.
Any visibility / sharing will be highly appreciated!
thanks in advance,

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io


Re: Official Stance on Not Using Spark Submit

2016-10-10 Thread Ofir Manor
Funny, someone from my team talked to me about that idea yesterday.
We use SparkLauncher, but it just calls spark-submit that calls other
scripts that starts a new Java program that tries to submit (in our case in
cluster mode - driver is started in the Spark cluster) and exit.
That make it a challenge to troubleshoot cases where submit fails,
especially when users tries our app on their own spark environment. He
hoped to get a more decent / specific exception if submit failed, or be
able to debug it in an IDE (the actual calling to the master, its response
etc).

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Mon, Oct 10, 2016 at 9:13 PM, Russell Spitzer <russell.spit...@gmail.com>
wrote:

> Just folks who don't want to use spark-submit, no real use-cases I've seen
> yet.
>
> I didn't know about SparkLauncher myself and I don't think there are any
> official docs on that or launching spark as an embedded library for tests.
>
> On Mon, Oct 10, 2016 at 11:09 AM Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>
>> What are the main use cases you've seen for this? Maybe we can add a page
>> to the docs about how to launch Spark as an embedded library.
>>
>> Matei
>>
>> On Oct 10, 2016, at 10:21 AM, Russell Spitzer <russell.spit...@gmail.com>
>> wrote:
>>
>> I actually had not seen SparkLauncher before, that looks pretty great :)
>>
>> On Mon, Oct 10, 2016 at 10:17 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> I'm definitely only talking about non-embedded uses here as I also use
>>> embedded Spark (cassandra, and kafka) to run tests. This is almost always
>>> safe since everything is in the same JVM. It's only once we get to
>>> launching against a real distributed env do we end up with issues.
>>>
>>> Since Pyspark uses spark submit in the java gateway i'm not sure if that
>>> matters :)
>>>
>>> The cases I see are usually usually going through main directly, adding
>>> jars programatically.
>>>
>>> Usually ends up with classpath errors (Spark not on the CP, their jar
>>> not on the CP, dependencies not on the cp),
>>> conf errors (executors have the incorrect environment, executor
>>> classpath broken, not understanding spark-defaults won't do anything),
>>> Jar version mismatches
>>> Etc ...
>>>
>>> On Mon, Oct 10, 2016 at 10:05 AM Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> I have also 'embedded' a Spark driver without much trouble. It isn't
>>>> that it can't work.
>>>>
>>>> The Launcher API is ptobably the recommended way to do that though.
>>>> spark-submit is the way to go for non programmatic access.
>>>>
>>>> If you're not doing one of those things and it is not working, yeah I
>>>> think people would tell you you're on your own. I think that's consistent
>>>> with all the JIRA discussions I have seen over time.
>>>>
>>>>
>>>> On Mon, Oct 10, 2016, 17:33 Russell Spitzer <russell.spit...@gmail.com>
>>>> wrote:
>>>>
>>>>> I've seen a variety of users attempting to work around using Spark
>>>>> Submit with at best middling levels of success. I think it would be 
>>>>> helpful
>>>>> if the project had a clear statement that submitting an application 
>>>>> without
>>>>> using Spark Submit is truly for experts only or is unsupported entirely.
>>>>>
>>>>> I know this is a pretty strong stance and other people have had
>>>>> different experiences than me so please let me know what you think :)
>>>>>
>>>>
>>


Re: Spark Improvement Proposals

2016-10-09 Thread Ofir Manor
This is a great discussion!
Maybe you could have a look at Kafka's process - it also uses Rejected
Alternatives and I personally find it very clear actually (the link also
leads to all KIPs):

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
Cody - maybe you could take one of the open issues and write a sample
proposal? A concrete example might make it clearer for those who see this
for the first time. Maybe the Kafka offset discussion or some other
Kafka/Structured Streaming open issue? Will that be helpful?

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Mon, Oct 10, 2016 at 12:36 AM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> Yup, this is the stuff that I found unclear. Thanks for clarifying here,
> but we should also clarify it in the writeup. In particular:
>
> - Goals needs to be about user-facing behavior ("people" is broad)
>
> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will dig up
> one of these and say "Spark's developers have officially rejected X, which
> our awesome system has".
>
> - For user-facing stuff, I think you need a section on API. Virtually all
> other *IPs I've seen have that.
>
> - I'm still not sure why the strategy section is needed if the purpose is
> to define user-facing behavior -- unless this is the strategy for setting
> the goals or for defining the API. That sounds squarely like a design doc
> issue. In some sense, who cares whether the proposal is technically
> feasible right now? If it's infeasible, that will be discovered later
> during design and implementation. Same thing with rejected strategies --
> listing some of those is definitely useful sometimes, but if you make this
> a *required* section, people are just going to fill it in with bogus stuff
> (I've seen this happen before).
>
> Matei
>
> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger <c...@koeninger.org> wrote:
> >
> > So to focus the discussion on the specific strategy I'm suggesting,
> > documented at
> >
> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >
> > "Goals: What must this allow people to do, that they can't currently?"
> >
> > Is it unclear that this is focusing specifically on people-visible
> behavior?
> >
> > Rejected goals -  are important because otherwise people keep trying
> > to argue about scope.  Of course you can change things later with a
> > different SIP and different vote, the point is to focus.
> >
> > Use cases - are something that people are going to bring up in
> > discussion.  If they aren't clearly documented as a goal ("This must
> > allow me to connect using SSL"), they should be added.
> >
> > Internal architecture - if the people who need specific behavior are
> > implementers of other parts of the system, that's fine.
> >
> > Rejected strategies - If you have none of these, you have no evidence
> > that the proponent didn't just go with the first thing they had in
> > mind (or have already implemented), which is a big problem currently.
> > Approval isn't binding as to specifics of implementation, so these
> > aren't handcuffs.  The goals are the contract, the strategy is
> > evidence that contract can actually be met.
> >
> > Design docs - I'm not touching design docs.  The markdown file I
> > linked specifically says of the strategy section "This is not a full
> > design document."  Is this unclear?  Design docs can be worked on
> > obviously, but that's not what I'm concerned with here.
> >
> >
> >
> >
> > On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
> >> Hi Cody,
> >>
> >> I think this would be a lot more concrete if we had a more detailed
> template
> >> for SIPs. Right now, it's not super clear what's in scope -- e.g. are
> they
> >> a way to solicit feedback on the user-facing behavior or on the
> internals?
> >> "Goals" can cover both things. I've been thinking of SIPs more as
> Product
> >> Requirements Docs (PRDs), which focus on *what* a code change should do
> as
> >> opposed to how.
> >>
> >> In particular, here are some things that you may or may not consider in
> >> scope for SIPs:
> >>
> >> - Goals and non-goals: This is definitely in scope, and IMO should
> focus on
> >> user-visible behavior (e.g. "system supports SQL window functions" or
> >> "system continues working if one node fails"). BTW I wouldn't sa

Re: Structured Streaming with Kafka sources/sinks

2016-08-30 Thread Ofir Manor
I personally find it disappointing that a big chuck of Spark's design and
development is happening behind closed curtains. It makes it harder than
necessary for me to work with Spark. We had to improvise in the recent
weeks a temporary solution for reading from Kafka (from Structured
Streaming) to unblock our development, and I feed that if the design and
development of that feature was done in the open, it would have saved us a
lot of hassle (and would reduce the refactoring of our code base).

It hard not compare it to other Apache projects - for example, I believe
most of the Apache Kafka full-time contributors work at a single company,
but they manage as a community to have a very transparent design and
development process, which seems to work great.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Mon, Aug 29, 2016 at 10:39 PM, Fred Reiss <freiss@gmail.com> wrote:

> I think that the community really needs some feedback on the progress of
> this very important task. Many existing Spark Streaming applications can't
> be ported to Structured Streaming without Kafka support.
>
> Is there a design document somewhere?  Or can someone from the DataBricks
> team break down the existing monolithic JIRA issue into smaller steps that
> reflect the current development plan?
>
> Fred
>
>
> On Sat, Aug 27, 2016 at 2:32 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> thats great
>>
>> is this effort happening anywhere that is publicly visible? github?
>>
>> On Tue, Aug 16, 2016 at 2:04 AM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> We (the team at Databricks) are working on one currently.
>>>
>>>
>>> On Mon, Aug 15, 2016 at 7:26 PM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>>
>>>> https://issues.apache.org/jira/browse/SPARK-15406
>>>>
>>>> I'm not working on it (yet?), never got an answer to the question of
>>>> who was planning to work on it.
>>>>
>>>> On Mon, Aug 15, 2016 at 9:12 PM, Guo, Chenzhao <chenzhao@intel.com>
>>>> wrote:
>>>> > Hi all,
>>>> >
>>>> >
>>>> >
>>>> > I’m trying to write Structured Streaming test code and will deal with
>>>> Kafka
>>>> > source. Currently Spark 2.0 doesn’t support Kafka sources/sinks.
>>>> >
>>>> >
>>>> >
>>>> > I found some Databricks slides saying that Kafka sources/sinks will be
>>>> > implemented in Spark 2.0, so is there anybody working on this? And
>>>> when will
>>>> > it be released?
>>>> >
>>>> >
>>>> >
>>>> > Thanks,
>>>> >
>>>> > Chenzhao Guo
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>
>


Re: [ANNOUNCE] Announcing Apache Spark 2.0.0

2016-07-27 Thread Ofir Manor
Hold the release! There is a minor documentation issue :)
But seriously, congrats all on this massive achievement!

Anyway, I think it would be very helpful to add a link to the Structured
Streaming Developer Guide (Alpha) to both the documentation home page and
from the beginning of the "old" Spark Streaming Programming Guide, as I
think many users will look for them. I had a "deep link" to that page so I
haven't noticed that it is very hard to find until now. I'm referring to
this page:

http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html



Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Wed, Jul 27, 2016 at 9:00 AM, Reynold Xin <r...@databricks.com> wrote:

> Hi all,
>
> Apache Spark 2.0.0 is the first release of Spark 2.x line. It includes
> 2500+ patches from 300+ contributors.
>
> To download Spark 2.0, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> http://spark.apache.org/releases/spark-release-2-0-0.html
>
>
> (note: it can take a few hours for everything to be propagated, so you
> might get 404 on some download links.  If you see any issues with the
> release notes or webpage *please contact me directly, off-list*)
>
>


Re: drop java 7 support for spark 2.1.x or spark 2.2.x

2016-07-24 Thread Ofir Manor
BTW - "signalling ahead of time" is called deprecating, not dropping
support...
(personally I only use JDK 8 / Scala 2.11 so I'm for it)


Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Sun, Jul 24, 2016 at 1:50 AM, Koert Kuipers <ko...@tresata.com> wrote:

> i care about signalling it in advance mostly. and given the performance
> differences we do have some interest in pushing towards java 8
>
> On Jul 23, 2016 6:10 PM, "Mark Hamstra" <m...@clearstorydata.com> wrote:
>
> Why the push to remove Java 7 support as soon as possible (which is how I
> read your "cluster admins plan to migrate by date X, so Spark should end
> Java 7 support then, too")?  First, I don't think we should be removing
> Java 7 support until some time after all or nearly all relevant clusters
> are actually no longer running on Java 6, and that targeting removal of
> support at our best guess about when admins are just *planning* to migrate
> isn't a very good idea.  Second, I don't see the significant difficulty or
> harm in continuing to support Java 7 for a while longer.
>
> On Sat, Jul 23, 2016 at 2:54 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> dropping java 7 support was considered for spark 2.0.x but we decided
>> against it.
>>
>> ideally dropping support for a java version should be communicated far in
>> advance to facilitate the transition.
>>
>> is this the right time to make that decision and start communicating it
>> (mailing list, jira, etc.)? perhaps for spark 2.1.x or spark 2.2.x?
>>
>> my general sense is that most cluster admins have plans to migrate to
>> java 8 before end of year. so that could line up nicely with spark 2.2
>>
>>
>
>


Structured Streaming with Kafka source/sink

2016-05-11 Thread Ofir Manor
Hi,
I'm trying out Structured Streaming from current 2.0 branch.
Does the branch currently support Kafka as either source or sink? I
couldn't find a specific JIRA or design doc for that in SPARK-8360 or in
the examples... Is it still targeted for 2.0?

Also, I naively assume it will look similar to hdfs or JDBC *stream("path")*,
where the path will be some sort of Kafka URI (maybe protocol + broker list
+ topic list). Is that the current thinking?

Thanks,

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io


Re: [discuss] ending support for Java 7 in Spark 2.0

2016-04-04 Thread Ofir Manor
I think that a backup plan could be to announce that JDK7 is deprecated in
Spark 2.0 and support for it will be fully removed in Spark 2.1. This gives
admins enough warning to install JDK8 along side their "main" JDK (or fully
migrate to it), while allowing the project to merge JDK8-specific changes
to trunk right after the 2.0 release.

However, I personally think it is better to drop JDK7 now. I'm sure that
both the community and the distributors (Databricks, Cloudera, Hortonworks,
MapR, IBM etc) will all rush to help their customers migrate their
environment to support Spark 2.0, so I think any backlash won't be dramatic
or lasting.

Just my two cents,

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Mon, Apr 4, 2016 at 6:48 PM, Luciano Resende <luckbr1...@gmail.com>
wrote:

> Reynold,
>
> Considering the performance improvements you mentioned in your original
> e-mail and also considering that few other big data projects have already
> or are in progress of abandoning JDK 7, I think it would benefit Spark if
> we go with JDK 8.0 only.
>
> Are there users that will be less aggressive ? Yes, but those would most
> likely be in more stable releases like 1.6.x.
>
> On Sun, Apr 3, 2016 at 10:28 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> Since my original email, I've talked to a lot more users and looked at
>> what various environments support. It is true that a lot of enterprises,
>> and even some technology companies, are still using Java 7. One thing is
>> that up until this date, users still can't install openjdk 8 on Ubuntu by
>> default. I see that as an indication that it is too early to drop Java 7.
>>
>> Looking at the timeline, JDK release a major new version roughly every 3
>> years. We dropped Java 6 support one year ago, so from a timeline point of
>> view we would be very aggressive here if we were to drop Java 7 support in
>> Spark 2.0.
>>
>> Note that not dropping Java 7 support now doesn't mean we have to support
>> Java 7 throughout Spark 2.x. We dropped Java 6 support in Spark 1.5, even
>> though Spark 1.0 started with Java 6.
>>
>> In terms of testing, Josh has actually improved our test infra so now we
>> would run the Java 8 tests: https://github.com/apache/spark/pull/12073
>>
>>
>>
>>
>> On Thu, Mar 24, 2016 at 8:51 PM, Liwei Lin <lwl...@gmail.com> wrote:
>>
>>> Arguments are really convincing; new Dataset API as well as performance
>>>
>>> improvements is exiting, so I'm personally +1 on moving onto Java8.
>>>
>>>
>>>
>>> However, I'm afraid Tencent is one of "the organizations stuck with
>>> Java7"
>>>
>>> -- our IT Infra division wouldn't upgrade to Java7 until Java8 is out,
>>> and
>>>
>>> wouldn't upgrade to Java8 until Java9 is out.
>>>
>>>
>>> So:
>>>
>>> (non-binding) +1 on dropping scala 2.10 support
>>>
>>> (non-binding)  -1 on dropping Java 7 support
>>>
>>>   * as long as we figure out a practical way to run
>>> Spark with
>>>
>>> JDK8 on JDK7 clusters, this -1 would then
>>> definitely be +1
>>>
>>>
>>> Thanks !
>>>
>>> On Fri, Mar 25, 2016 at 10:28 AM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> i think that logic is reasonable, but then the same should also apply
>>>> to scala 2.10, which is also unmaintained/unsupported at this point
>>>> (basically has been since march 2015 except for one hotfix due to a license
>>>> incompatibility)
>>>>
>>>> who wants to support scala 2.10 three years after they did the last
>>>> maintenance release?
>>>>
>>>>
>>>> On Thu, Mar 24, 2016 at 9:59 PM, Mridul Muralidharan <mri...@gmail.com>
>>>> wrote:
>>>>
>>>>> Removing compatibility (with jdk, etc) can be done with a major
>>>>> release- given that 7 has been EOLed a while back and is now unsupported,
>>>>> we have to decide if we drop support for it in 2.0 or 3.0 (2+ years from
>>>>> now).
>>>>>
>>>>> Given the functionality & performance benefits of going to jdk8,
>>>>> future enhancements relevant in 2.x timeframe ( scala, dependencies) which
>>>>> requires it, and simplicity wrt code, test & support it looks like a good
>>>>> checkpoint to drop jdk7 support.
>>>&g