spark support on windows

2017-01-16 Thread assaf.mendelson
Hi, In the documentation it says spark is supported on windows. The problem, however, is that the documentation description on windows is lacking. There are sources (such as https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html and

implement UDF/UDAF supporting whole stage codegen

2016-09-07 Thread assaf.mendelson
Hi, I want to write a UDF/UDAF which provides native processing performance. Currently, when creating a UDF/UDAF in a normal manner the performance is hit because it breaks optimizations. For a simple example I wanted to create a UDF which tests whether the value is smaller than 10. I tried

UDF and native functions performance

2016-09-12 Thread assaf.mendelson
I am trying to create UDFs with improved performance. So I decided to compare several ways of doing it. In general I created a dataframe using range with 50M elements, cached it and counted it to manifest it. I then implemented a simple predicate (x<10) in 4 different ways, counted the

Test fails when compiling spark with tests

2016-09-11 Thread assaf.mendelson
Hi, I am trying to set up a spark development environment. I forked the spark git project and cloned the fork. I then checked out branch-2.0 tag (which I assume is the released source code). I then compiled spark twice. The first using: mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests

RE: UDF and native functions performance

2016-09-12 Thread assaf.mendelson
Codegen(). // maropu On Mon, Sep 12, 2016 at 7:43 PM, assaf.mendelson <assaf.mendelson@<mailto:assaf.mendelson@>...> wrote: I am trying to create UDFs with improved performance. So I decided to compare several ways of doing it. In general I created a dataframe using range with 50M elements, cach

RE: Memory usage for spark types

2016-09-20 Thread assaf.mendelson
, Sep 18, 2016 at 9:06 AM, assaf.mendelson <[hidden email]> wrote: Hi, I am trying to understand how spark types are kept in memory and accessed. I tried to look at the code at the definition of MapType and ArrayType for example and I can’t seem to find the relevant code for its

https://issues.apache.org/jira/browse/SPARK-17691

2016-09-27 Thread assaf.mendelson
Hi, I wanted to try to implement https://issues.apache.org/jira/browse/SPARK-17691. So I started by looking at the implementation of collect_list. My idea was, do the same as they but when adding a new element, if there are already more than the threshold, remove one instead. The problem with

RE: Watermarking in Structured Streaming to drop late data

2016-10-27 Thread assaf.mendelson
Hi, Should comments come here or in the JIRA? Any, I am a little confused on the need to expose this as an API to begin with. Let’s consider for a second the most basic behavior: We have some input stream and we want to aggregate a sum over a time window. This means that the window we should be

RE: Watermarking in Structured Streaming to drop late data

2016-10-27 Thread assaf.mendelson
-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Kind Regards 2016-10-27 9:46 GMT+03:00 assaf.mendelson <[hidden email]>: Hi, Should comments come here or in the JIRA? Any, I am a little confused on the need to expose this as an API to begin with.

RE: Handling questions in the mailing lists

2016-11-08 Thread assaf.mendelson
the code. I >> think that given the scale of the list, it's not wrong to assert that this >> is sort of a prerequisite for asking thousands of people to answer one's >> question. But we can't enforce that. >> >> >> >> The situation will get better to the extent p

RE: Handling questions in the mailing lists

2016-11-09 Thread assaf.mendelson
Community Mailing Lists / StackOverflow Changes<https://docs.google.com/document/d/1N0pKatcM15cqBPqFWCqIy6jdgNzIoacZlYDCjufBh2s/edit#heading=h.xshc1bv4sn3p> has been updated to include suggested tags. WDYT? On Tue, Nov 8, 2016 at 11:02 PM assaf.mendelson <[hidden email]> wrote: I like

RE: StructuredStreaming status

2016-10-19 Thread assaf.mendelson
There is one issue I was thinking of. If I understand correctly, structured streaming basically groups by a bucket for time in sliding window (of the step). My problem is that in some cases (e.g. distinct count and any other case where the buffer is relatively large) this would mean copying the

RE: StructuredStreaming status

2016-10-20 Thread assaf.mendelson
My thoughts were of handling just the “current” state of the sliding window (i.e. the “last” window). The idea is that at least in cases which I encountered, the sliding window is used to “forget” irrelevant information and therefore when a step goes out of date for the “current” window it

Converting spark types and standard scala types

2016-10-25 Thread assaf.mendelson
Hi, I am trying to write a new aggregate function (https://issues.apache.org/jira/browse/SPARK-17691) and I wanted it to support all ordered types. I have several issues though: 1. How to convert the type of the child expression to a Scala standard type (e.g. I need an Array[Int] for

separate spark and hive

2016-11-14 Thread assaf.mendelson
Hi, Today, we basically force people to use hive if they want to get the full use of spark SQL. When doing the default installation this means that a derby.log and metastore_db directory are created where we run from. The problem with this is that if we run multiple scripts from the same working

RE: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread assaf.mendelson
I am not sure I understand when the statistics would be calculated. Would they always be calculated or just when analyze is called? Would it be possible to save analysis results as part of dataframe saving (e.g. when writing it to parquet) or do we have to have a consistent hive installation?

RE: [ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread assaf.mendelson
While you can download spark 2.0.2, the description is still spark 2.0.1: Our latest stable version is Apache Spark 2.0.1, released on Oct 3, 2016 (release notes) (git tag) From:

RE: separate spark and hive

2016-11-15 Thread assaf.mendelson
make sense to do, but it just seems a lot of work to do right now and it'd take a toll on interoperability. If you don't need persistent catalog, you can just run Spark without Hive mode, can't you? On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <[hidden email]> wrote: Hi, Today, we

RE: how does isDistinct work on expressions

2016-11-13 Thread assaf.mendelson
gs to where. Are you creating a new UDAF? What have you done already? GitHub perhaps? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sun, Nov 13, 2016 at 12:03

RE: separate spark and hive

2016-11-15 Thread assaf.mendelson
f you don't need persistent catalog, you can just run Spark without Hive mode, can't you? On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <[hidden email]> wrote: Hi, Today, we basically force people to use hive if they want to get the full use of spark SQL. When doing the default i

RE: Handling questions in the mailing lists

2016-11-23 Thread assaf.mendelson
ommunity.html page. (We're going to migrate the wiki real soon now anyway) I updated the /community.html page per this thread too. PR: https://github.com/apache/spark-website/pull/16 On Tue, Nov 15, 2016 at 2:49 PM assaf.mendelson <[hidden email]> wrote: Should probably also update the h

Aggregating over sorted data

2016-11-23 Thread assaf.mendelson
Hi, An issue I have encountered frequently is the need to look at data in an ordered manner per key. A common way of doing this can be seen in the classic map reduce as the shuffle stage provides sorted data per key and one can therefore do a lot with that. It is of course relatively easy to

RE: Handling questions in the mailing lists

2016-11-24 Thread assaf.mendelson
critical mass after 2 years. It would just fracture the discussion to yet another place. On Thu, Nov 24, 2016 at 6:52 AM assaf.mendelson <[hidden email]> wrote: Sorry to reawaken this, but I just noticed it is possible to propose new topic specific sites (http://area51.stackexchange.c

structured streaming and window functions

2016-11-17 Thread assaf.mendelson
Hi, I have been trying to figure out how structured streaming handles window functions efficiently. The portion I understand is that whenever new data arrived, it is grouped by the time and the aggregated data is added to the state. However, unlike operations like sum etc. window functions need

RE: structured streaming and window functions

2016-11-17 Thread assaf.mendelson
it sounds, because you need to maintain state in a fault tolerant way and you need to have some eviction policy (watermarks for instance) for aggregation buffers to prevent the state store from reaching an infinite size. On Thu, Nov 17, 2016 at 12:19 AM, assaf.mendelson <[hidden email]> wro

RE: structured streaming and window functions

2016-11-17 Thread assaf.mendelson
. On each session, you want to count number of failed login events. If so, this is tracked by https://issues.apache.org/jira/browse/SPARK-10816 (didn't start yet) Ofir Manor Co-Founder & CTO | Equalum Mobile: +972-54-7801286 | Email: [hidden email] On Thu, Nov 17, 2016 at 2:52 PM, assaf.mende

RE: Handling questions in the mailing lists

2016-11-15 Thread assaf.mendelson
contributors besides just PMC/committers). On Wed, Nov 9, 2016 at 2:18 AM, assaf.mendelson <[hidden email]> wrote: I was just wondering, before we move on to SO. Do we have enough contributors with enough reputation do manage things in SO? We would need contributors with enough reputation to have r

how does isDistinct work on expressions

2016-11-13 Thread assaf.mendelson
Hi, I am trying to understand how aggregate functions are implemented internally. I see that the expression is wrapped using toAggregateExpression using isDistinct. I can't figure out where the code that makes the data distinct is located. I am trying to figure out how the input data is

Converting spark types and standard scala types

2016-11-13 Thread assaf.mendelson
Hi, I am trying to write a new aggregate function (https://issues.apache.org/jira/browse/SPARK-17691) and I wanted it to support all ordered types. I have several issues though: 1. How to convert the type of the child expression to a Scala standard type (e.g. I need an Array[Int] for

Handling questions in the mailing lists

2016-11-02 Thread assaf.mendelson
Hi, I know this is a little off topic but I wanted to raise an issue about handling questions in the mailing list (this is true both for the user mailing list and the dev but since there are other options such as stack overflow for user questions, this is more problematic in dev). Let's say I

Using SPARK_WORKER_INSTANCES and SPARK-15781

2016-10-26 Thread assaf.mendelson
As of applying SPARK-15781 the documentation of SPARK_WORKER_INSTANCES have been removed. This was due to a warning in spark-submit which suggested: WARN SparkConf: SPARK_WORKER_INSTANCES was detected (set to '4'). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with

RE: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-13 Thread assaf.mendelson
Hi, We are actually using pyspark heavily. I agree with all of your points, for me I see the following as the main hurdles: 1. Pyspark does not have support for UDAF. We have had multiple needs for UDAF and needed to go to java/scala to support these. Having python UDAF would have made

RE: Official Stance on Not Using Spark Submit

2016-10-13 Thread assaf.mendelson
I actually not use spark submit for several use cases, all of them currently revolve around running it directly with python. One of the most important ones is developing in pycharm. Basically I have am using pycharm and configure it with a remote interpreter which runs on the server while my

RE: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-11-30 Thread assaf.mendelson
I may be mistaken but if I remember correctly spark behaves differently when it is bounded in the past and when it is not. Specifically I seem to recall a fix which made sure that when there is no lower bound then the aggregation is done one by one instead of doing the whole range for each

RE: repeated unioning of dataframes take worse than O(N^2) time

2016-12-29 Thread assaf.mendelson
recursive structure in a similar manner to described here http://stackoverflow.com/q/34461804 You can try something like this http://stackoverflow.com/a/37612978 but there is of course on overhead of conversion between Dataset and RDD. On 12/29/2016 06:21 PM, assaf.mendelson wrote: Hi, I have

repeated unioning of dataframes take worse than O(N^2) time

2016-12-29 Thread assaf.mendelson
Hi, I have been playing around with doing union between a large number of dataframes and saw that the performance of the actual union (not the action) is worse than O(N^2). Since a union basically defines a lineage (i.e. current + union with of other as a child) this should be almost

RE: Shuffle intermidiate results not being cached

2016-12-27 Thread assaf.mendelson
eration. You just need to get the summary and union it with new dataframe to compute the newer aggregation summary in next iteration. It is more similar to streaming case, I don't think you can/should recompute all the data since the beginning of a stream. assaf.mendelson wrote The reason

RE: Shuffle intermidiate results not being cached

2016-12-26 Thread assaf.mendelson
The reason I thought some operations would be reused is the fact that spark automatically caches shuffle data which means the partial aggregation for pervious dataframes would be saved. Unfortunatly, as Mark Hamstra explained this is not the case because this is considered a new RDD and

RE: Aggregating over sorted data

2016-12-22 Thread assaf.mendelson
It seems that this aggregation is for dataset operations only. I would have hoped to be able to do dataframe aggregation. Something along the line of: sort_df(df).agg(my_agg_func) In any case, note that this kind of sorting is less efficient than the sorting done in window functions for

Shuffle intermidiate results not being cached

2016-12-26 Thread assaf.mendelson
Hi, Sorry to be bothering everyone on the holidays but I have found what may be a bug. I am doing a "manual" streaming (see http://stackoverflow.com/questions/41266956/apache-spark-streaming-performance for the specific code) where I essentially read an additional dataframe each time from

Possible bug: inconsistent timestamp behavior

2017-08-15 Thread assaf.mendelson
Hi all, I encountered weird behavior for timestamp. It seems that when using lit to add it to column, the timestamp goes from milliseconds representation to seconds representation: scala> spark.range(1).withColumn("a", lit(new java.sql.Timestamp(148550335L)).cast("long")).show()

RE: [SS] Why does ConsoleSink's addBatch convert input DataFrame to show it?

2017-07-07 Thread assaf.mendelson
I actually asked the same thing a couple of weeks ago. Apparently, when you create a structured streaming plan, it is different than the batch plan and is fixed in order to properly aggregate. If you perform most operations on the dataframe it will recalculate the plan as a batch plan and will

RE: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-26 Thread assaf.mendelson
Not a show stopper, however, I was looking at the structured streaming programming guide and under arbitrary stateful operations (https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-docs/structured-streaming-programming-guide.html#arbitrary-stateful-operations) the suggestion is

RE: [PYTHON] PySpark typing hints

2017-05-23 Thread assaf.mendelson
Actually there is, at least for pycharm. I actually opened a jira on it (https://issues.apache.org/jira/browse/SPARK-17333). It describes two way of doing it (I also made a github stub at: https://github.com/assafmendelson/ExamplePysparkAnnotation). Unfortunately, I never found the time to

Re: Spark data source resiliency

2018-07-03 Thread assaf.mendelson
That is what I expected, however, I did a very simple test (using println just to see when the exception is triggered in the iterator) using local master and I saw it failed once and cause the entire operation to fail. Is this something which may be unique to local master (or some default

Spark data source resiliency

2018-07-03 Thread assaf.mendelson
Hi All, I am implemented a data source V2 which integrates with an internal system and I need to make it resilient to errors in the internal data source. The issue is that currently, if there is an exception in the data reader, the exception seems to fail the entire task. I would prefer instead

Re: Spark data source resiliency

2018-07-03 Thread assaf.mendelson
You are correct, this solved it. Thanks -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Data source V2

2018-07-31 Thread assaf.mendelson
Hi all, I am currently in the middle of developing a new data source (for an internal tool) using data source V2. I noticed that SPARK-24882 is planned for 2.4 and includes interface changes. I was wondering if those are planned in addition

Why is SQLImplicits an abstract class rather than a trait?

2018-08-05 Thread assaf.mendelson
Hi all, I have been playing a bit with SQLImplicits and noticed that it is an abstract class. I was wondering why is that? It has no constructor. Because of it being an abstract class it means that adding a test trait cannot extend it and still be a trait. Consider the following: trait

Re: Why is SQLImplicits an abstract class rather than a trait?

2018-08-06 Thread assaf.mendelson
The import will work for the trait but not for anyone implementing the trait. As for not having a master, it was just an example, the full example contains some configurations. Thanks, Assaf -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

DataSourceV2 documentation & tutorial

2018-10-08 Thread assaf.mendelson
Hi all, I have been working on a legacy datasource integration with data source V2 for the last couple of week including upgrading it to the Spark 2.4.0 RC. During this process I wrote a tutorial with explanation on how to create a new datasource (it can be found in

Re: Data source V2 in spark 2.4.0

2018-10-04 Thread assaf.mendelson
Thanks for the info. I have been converting an internal data source to V2 and am now preparing it for 2.4.0. I have a couple of suggestions from my experience so far. First I believe we are missing documentation on this. I am currently writing an internal tutorial based on what I am learning, I

Docker image to build Spark/Spark doc

2018-10-10 Thread assaf.mendelson
Hi all, I was wondering if there was a docker image to build spark and/or spark documentation The idea would be that I would start the docker image, supplying the directory with my code and a target directory and it would simply build everything (maybe with some options). Any chance there is

Possible bug in DatasourceV2

2018-10-11 Thread assaf.mendelson
Hi, I created a datasource writer WITHOUT a reader. When I do, I get an exception: org.apache.spark.sql.AnalysisException: Data source is not readable: DefaultSource The reason for this is that when save is called, inside the source match to WriterSupport we have the following code: val source

Data source V2 in spark 2.4.0

2018-10-01 Thread assaf.mendelson
Hi all, I understood from previous threads that the Data source V2 API will see some changes in spark 2.4.0, however, I can't seem to find what these changes are. Is there some documentation which summarizes the changes? The only mention I seem to find is this pull request: