RE: using an alternative slf4j implementation

2017-02-06 Thread Mendelson, Assaf
of forcing logback as the binding? -Original Message- From: Jacek Laskowski [mailto:ja...@japila.pl] Sent: Monday, February 06, 2017 10:46 AM To: Mendelson, Assaf Cc: user Subject: Re: using an alternative slf4j implementation Hi, Sounds like a quite involved development for me. I can't

using an alternative slf4j implementation

2017-02-05 Thread Mendelson, Assaf
Hi, Spark seems to explicitly use log4j. This means that if I use an alternative backend for my application (e.g. ch.qos.logback) I have a conflict. Sure I can exclude logback but that means my application cannot use our internal tools. Is there a way to use logback as a backend logging while

RE: using an alternative slf4j implementation

2017-02-06 Thread Mendelson, Assaf
spark’s logs to log4j and my logs to logback or send everything to logback. Assaf. From: Jacek Laskowski [mailto:ja...@japila.pl] Sent: Monday, February 06, 2017 12:47 AM To: Mendelson, Assaf Cc: user Subject: Re: using an alternative slf4j implementation Hi, Shading conflicting dependencies

Wierd performance on windows laptop

2017-01-24 Thread Mendelson, Assaf
Hi, I created a simple synthetic test which does a sample calculation twice, each time with different partitioning: def time[R](block: => R): Long = { val t0 = System.currentTimeMillis() block// call-by-name val t1 = System.currentTimeMillis() t1 - t0 } val base_df =

RE: forcing dataframe groupby partitioning

2017-01-29 Thread Mendelson, Assaf
Could you explain why this would work? Assaf. From: Haviv, Daniel [mailto:dha...@amazon.com] Sent: Sunday, January 29, 2017 7:09 PM To: Mendelson, Assaf Cc: user@spark.apache.org Subject: Re: forcing dataframe groupby partitioning If there's no built in local groupBy, You could do something like

forcing dataframe groupby partitioning

2017-01-29 Thread Mendelson, Assaf
Hi, Consider the following example: df.groupby(C1,C2).agg(some agg).groupby(C1).agg(some more agg) The default way spark would behave would be to shuffle according to a combination of C1 and C2 and then shuffle again by C1 only. This behavior makes sense when one uses C2 to salt C1 for

RE: is dataframe thread safe?

2017-02-12 Thread Mendelson, Assaf
To: Mendelson, Assaf Cc: user Subject: Re: is dataframe thread safe? I am not sure what you are trying to achieve here. Spark is taking care of executing the transformations in a distributed fashion. This means you must not use threads - it does not make sense. Hence, you do not find documentation about

RE: is dataframe thread safe?

2017-02-12 Thread Mendelson, Assaf
for…) Assaf From: Jörn Franke [mailto:jornfra...@gmail.com] Sent: Sunday, February 12, 2017 2:46 PM To: Sean Owen Cc: Mendelson, Assaf; user Subject: Re: is dataframe thread safe? I did not doubt that the submission of several jobs of one application makes sense. However, he want to create threads

is dataframe thread safe?

2017-02-12 Thread Mendelson, Assaf
Hi, I was wondering if dataframe is considered thread safe. I know the spark session and spark context are thread safe (and actually have tools to manage jobs from different threads) but the question is, can I use the same dataframe in both threads. The idea would be to create a dataframe in

fault tolerant dataframe write with overwrite

2017-02-14 Thread Mendelson, Assaf
Hi, I have a case where I have an iterative process which overwrites the results of a previous iteration. Every iteration I need to write a dataframe with the results. The problem is that when I write, if I simply overwrite the results of the previous iteration, this is not fault tolerant. i.e.

RE: fault tolerant dataframe write with overwrite

2017-02-14 Thread Mendelson, Assaf
the relevant iteration. Basically I would have liked to see something like saving normally and the original data would not be removed until a successful write. Assaf. From: Jörn Franke [mailto:jornfra...@gmail.com] Sent: Tuesday, February 14, 2017 12:54 PM To: Mendelson, Assaf Cc: user Subject: Re

RE: [Spark Launcher] How to launch parallel jobs?

2017-02-14 Thread Mendelson, Assaf
You should also check your memory usage. Let’s say for example you have 16 cores and 8 GB. And that you use 4 executors with 1 core each. When you use an executor, spark reserves it from yarn and yarn allocates the number of cores (e.g. 1 in our case) and the memory. The memory is actually more

Updating variable in foreachRDD

2017-02-09 Thread Mendelson, Assaf
Hi, I was wondering on how foreachRDD would run. Specifically, let's say I do something like (nothing real, just for understanding): var df = ??? var counter = 0 dstream.foreachRDD { rdd: RDD[Long] => { val df2 = rdd.toDF(...) df = df.union(df2) counter += 1 if

RE: fault tolerant dataframe write with overwrite

2017-02-14 Thread Mendelson, Assaf
Loughran [mailto:ste...@hortonworks.com] Sent: Tuesday, February 14, 2017 3:25 PM To: Mendelson, Assaf Cc: Jörn Franke; user Subject: Re: fault tolerant dataframe write with overwrite On 14 Feb 2017, at 11:12, Mendelson, Assaf <assaf.mendel...@rsa.com<mailto:assaf.mendel...@rsa.com>> wr

handling dependency conflicts with spark

2017-02-27 Thread Mendelson, Assaf
Hi, I have a project which uses Jackson 2.8.5. Spark on the other hand seems to be using 2.6.5 I am using maven to compile. My original solution to the problem have been to set spark dependencies with the "provided" scope and use maven shade plugin to shade Jackson in my compilation. The

RE: classpath conflict with spark internal libraries and the spark shell.

2016-09-11 Thread Mendelson, Assaf
You can try shading the jar. Look at maven shade plugin From: Benyi Wang [mailto:bewang.t...@gmail.com] Sent: Saturday, September 10, 2016 1:35 AM To: Colin Kincaid Williams Cc: user@spark.apache.org Subject: Re: classpath conflict with spark internal libraries and

RE: Selecting the top 100 records per group by?

2016-09-11 Thread Mendelson, Assaf
You can also create a custom aggregation function. It might provide better performance than dense_rank. Consider the following example to collect everything as list: class CollectListFunction[T](val colType: DataType) extends UserDefinedAggregateFunction { def inputSchema: StructType =

RE: add jars like spark-csv to ipython notebook with pyspakr

2016-09-11 Thread Mendelson, Assaf
In my case I do the following: export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser" pyspark --jars myjar.jar --driver-class-path myjar.jar hope this helps… From: pseudo oduesp [mailto:pseudo20...@gmail.com] Sent: Friday, September 09, 2016 3:55 PM To:

RE: building runnable distribution from source

2016-09-29 Thread Mendelson, Assaf
] Sent: Thursday, September 29, 2016 1:20 PM To: Mendelson, Assaf Cc: user@spark.apache.org Subject: Re: building runnable distribution from source Check that your R is properly installed: >Cannot find 'R_HOME'. Please specify 'R_HOME' or make sure R is properly >installed. On Thu, 2016

RE: udf of aggregation in pyspark dataframe ?

2016-09-30 Thread Mendelson, Assaf
I may be missing something here, but it seems to me you can do it like this: df.groupBy('a').agg(collect_list('c').alias("a",collect_list('d').alias("b")).withColumn('named_list'), my_zip(F.Col("a"), F.Col("b")) without needing to write a new aggregation function -Original Message- From:

RE: Spark security

2016-10-27 Thread Mendelson, Assaf
Anyone can assist with this? From: Mendelson, Assaf [mailto:assaf.mendel...@rsa.com] Sent: Thursday, October 13, 2016 3:41 PM To: user@spark.apache.org Subject: Spark security Hi, We have a spark cluster and we wanted to add some security for it. I was looking at the documentation (in http

RE: Re:RE: how to merge dataframe write output files

2016-11-10 Thread Mendelson, Assaf
As people stated, when you coalesce to 1 partition then basically you lose all parallelism, however, you can coalesce to a difference value. If for example you coalesce to 20 then you can parallelize up to 20 different tasks. You have a total of 4 executors, with 2 cores each. This means that

RE: Aggregate UDF (UDAF) in Python

2016-10-18 Thread Mendelson, Assaf
df.agg(pythonudaf(df.id)).show() Lastly when you run, make sure to use both –jars and --driver-class-path with the jar created from scala to make sure it is available in all nodes. From: Tobi Bosede [mailto:ani.to...@gmail.com] Sent: Monday, October 17, 2016 10:15 PM To: Mendelson, Assaf Cc:

RE: pyspark dataframe codes for lead lag to column

2016-10-20 Thread Mendelson, Assaf
Depending on your usecase, you may want to take a look at window functions From: muhammet pakyürek [mailto:mpa...@hotmail.com] Sent: Thursday, October 20, 2016 11:36 AM To: user@spark.apache.org Subject: pyspark dataframe codes for lead lag to column is there pyspark dataframe codes for lead

RE: Aggregate UDF (UDAF) in Python

2016-10-17 Thread Mendelson, Assaf
A possible (bad) workaround would be to use the collect_list function. This will give you all the values in an array (list) and you can then create a UDF to do the aggregation yourself. This would be very slow and cost a lot of memory but it would work if your cluster can handle it. This is the

RE: Spark 2.0 - DataFrames vs Dataset performance

2016-10-24 Thread Mendelson, Assaf
Hi, I believe that the UDF is only a small part of the problem. You can easily test by doing a UDF for dataframe too. In my testing I saw that using datasets can be considerably slower than dataframe. I can make a guess as to why this happens. Basically what you are doing in a dataframe is

how does create dataframe from scala collection handle executor failure?

2016-11-22 Thread Mendelson, Assaf
Lets say I have loop that reads some data from somewhere, stores it in a collection and creates a dataframe from it. Then an executor containing part of the dataframe dies. How does spark handle it? For example: val dfSeq = for { I <- 0 to 1000

RE: How does predicate push down really help?

2016-11-16 Thread Mendelson, Assaf
Actually, both you translate to the same plan. When you do sql(“some code”) or filter, it doesn’t actually do the query. Instead it is translated to a plan (parsed plan) which transform everything into standard spark expressions. Then spark analyzes it to fill in the blanks (what is users table

RE: Nested UDFs

2016-11-16 Thread Mendelson, Assaf
Regexp_replace is supposed to receive a column, you don’t need to write a UDF for it. Instead try: Test_data.select(regexp_Replace(test_data.name, ‘a’, ‘X’) You would need a Udf if you would wanted to do something on the string value of a single row (e.g. return data + “bla”) Assaf. From:

RE: How does predicate push down really help?

2016-11-16 Thread Mendelson, Assaf
: kant kodali [mailto:kanth...@gmail.com] Sent: Thursday, November 17, 2016 9:50 AM To: Mendelson, Assaf Cc: user @spark Subject: Re: How does predicate push down really help? Hi Assaf, I am still trying to understand the merits of predicate push down from the examples you pointed out. Example 1

RE: Nested UDFs

2016-11-17 Thread Mendelson, Assaf
replacements so something like: def my_f(data): for match, repl in regexp_list: data = regexp_replace(match, repl, data) return data I could achieve my goal by mutiple .select(regexp_replace()) lines, but one UDF would be nicer. -Perttu to 17. marraskuuta 2016 klo 9.42 Mendelson, Assaf

RE: DataFrame select non-existing column

2016-11-18 Thread Mendelson, Assaf
You can always add the columns to old dataframes giving them null (or some literal) as a preprocessing. -Original Message- From: Kristoffer Sjögren [mailto:sto...@gmail.com] Sent: Friday, November 18, 2016 4:32 PM To: user Subject: DataFrame select non-existing column Hi We have

RE: DataFrame select non-existing column

2016-11-18 Thread Mendelson, Assaf
In pyspark for example you would do something like: df.withColumn("newColName",pyspark.sql.functions.lit(None)) Assaf. -Original Message- From: Kristoffer Sjögren [mailto:sto...@gmail.com] Sent: Friday, November 18, 2016 9:19 PM To: Mendelson, Assaf Cc: user Subject: Re:

RE: DataFrame select non-existing column

2016-11-20 Thread Mendelson, Assaf
; dataFrame.select("pass.mobile"); [2] root |-- pass: struct (nullable = true) ||-- auction: struct (nullable = true) |||-- id: integer (nullable = true) ||-- geo: struct (nullable = true) |||-- postalCode: string (nullable = true) |-- pass.mobile: long (nul

RE: DataFrame select non-existing column

2016-11-20 Thread Mendelson, Assaf
== 'pass': found_pass = True for g in f.dataType: if g.name == 'mobile': found_mobile = True break Assaf -Original Message- From: Kristoffer Sjögren [mailto:sto...@gmail.com] Sent: Sunday, November 20, 2016 4:13 PM To: Mendelson, Assaf Cc

RE: Use a specific partition of dataframe

2016-11-03 Thread Mendelson, Assaf
There are a couple of tools you can use. Take a look at the various functions. Specifically, limit might be useful for you and sample/sampleBy functions can make your data smaller. Actually, when using CreateDataframe you can sample the data to begin with. Specifically working by partitions can

RE: Quirk in how Spark DF handles JSON input records?

2016-11-03 Thread Mendelson, Assaf
I agree this can be a little annoying. The reason this is done this way is to enable cases where the json file is huge. To allow splitting it, a separator is needed and newline is the separator used (as is done in all text files in Hadoop and spark). I always wondered why support has not been

Spark security

2016-10-13 Thread Mendelson, Assaf
Hi, We have a spark cluster and we wanted to add some security for it. I was looking at the documentation (in http://spark.apache.org/docs/latest/security.html) and had some questions. 1. Do all executors listen by the same blockManager port? For example, in yarn there are multiple

RE: few basic questions on structured streaming

2016-12-08 Thread Mendelson, Assaf
For watermarking you can read this excellent article: part 1: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101, part2: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102. It explains more than just watermarking but it helped me understand a lot of the concepts

RE: filter RDD by variable

2016-12-08 Thread Mendelson, Assaf
Can you provide the sample code you are using? In general, RDD filter receives as an input a function. The function’s input is the single record in the RDD and the output is a Boolean whether or not to include it in the result. So you can create any function you want… Assaf. From: Soheila S.

RE: How to find unique values after groupBy() in spark dataframe ?

2016-12-08 Thread Mendelson, Assaf
Groupby is not an actual result but a construct to allow defining aggregations. So you can do: import org.apache.spark.sql.{functions => func} val resDF = df.groupBy("client").agg(func.collect_set(df("Date"))) Note that collect_set can be a little heavy in terms

RE: About transformations

2016-12-09 Thread Mendelson, Assaf
This is a guess but I would bet that most of the time when into the loading of the data. The second time there are many places this could be cached (either by spark or even by the OS if you are reading from file). -Original Message- From: brccosta [mailto:brunocosta@gmail.com]

Creating schema from json representation

2016-12-04 Thread Mendelson, Assaf
Hi, I am trying to save a spark dataframe schema in scala. I can do df.schema.json to get the json string representation. Now I want to get the schema back from the json. However, it seems I need to parse the json string myself, get its fields object and generate the fields manually. Is there a

RE: Creating schema from json representation

2016-12-04 Thread Mendelson, Assaf
Answering my own question (for those who are interested): val schema = df.schema val jsonString = schema.json val backToSchema = DataType.fromJson(jsonString).asInstanceOf[StructType] From: Mendelson, Assaf [mailto:assaf.mendel...@rsa.com] Sent: Sunday, December 04, 2016 11:11 AM To: user

RE: Writing DataFrame filter results to separate files

2016-12-05 Thread Mendelson, Assaf
If you write to parquet you can use the partitionBy option which would write under a directory for each value of the column (assuming you have a column with the month). From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Tuesday, December 06, 2016 3:33 AM To: Everett Anderson Cc: user

doing streaming efficiently

2016-12-06 Thread Mendelson, Assaf
Hi, I have a system which does streaming doing analysis over a long period of time. For example a sliding window of 24 hours every 15 minutes. I have a batch process I need to convert to this streaming. I am wondering how to do so efficiently. I am currently building the streaming process so I

RE: top-k function for Window

2017-01-03 Thread Mendelson, Assaf
You can write a UDAF in which the buffer contains the top K and manage it. This means you don’t need to sort at all. Furthermore, in your example you don’t even need a window function, you can simply use groupby and explode. Of course, this is only relevant if k is small… From: Andy Dang

RE: top-k function for Window

2017-01-03 Thread Mendelson, Assaf
03, 2017 8:03 PM To: Mendelson, Assaf Cc: user Subject: Re: top-k function for Window > Furthermore, in your example you don’t even need a window function, you can > simply use groupby and explode Can you clarify? You need to sort somehow (be it map-side sorting or reduce-side s

RE: What is missing here to use sql in spark?

2017-01-02 Thread Mendelson, Assaf
sqlContext.sql("select distinct CARRIER from flight201601") defines a dataframe which is lazily evaluated. This means that it returns a dataframe (which is what you got). If you want to see the results do: sqlContext.sql("select distinct CARRIER from flight201601").show() or df =

RE: streaming performance

2016-12-25 Thread Mendelson, Assaf
for outer join etc.). Assaf. From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: Friday, December 23, 2016 2:46 AM To: Mendelson, Assaf Cc: user Subject: Re: streaming performance From what I understand looking at the code in stackoverflow, I think you are "simulating" the

RE: Location for the additional jar files in Spark

2016-12-27 Thread Mendelson, Assaf
You should probably add --driver-class-path with the jar as well. In theory --jars should add it to the driver as well but in my experience it does not (I think there was a jira open on it). In any case you can find it in stackoverflow: See

streaming performance

2016-12-21 Thread Mendelson, Assaf
am having trouble with streaming performance. My main problem is how to do a sliding window calculation where the ratio between the window size and the step size is relatively large (hundreds) without recalculating everything all the time. I created a simple example of what I am aiming at with

RE: How to use ManualClock with Spark streaming

2017-04-05 Thread Mendelson, Assaf
You can try taking a look at this: http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/ Thanks, Assaf. From: Hemalatha A [mailto:hemalatha.amru...@googlemail.com] Sent: Wednesday, April 05, 2017 1:59 PM To: Saisai Shao; user@spark.apache.org Subject: Re: How to use

RE: Timeline for stable release for Spark Structured Streaming

2017-07-10 Thread Mendelson, Assaf
Any day now. One of the major milestones of spark 2.2 is making structured streaming a stable feature. Spark 2.2 has passed RC6 a couple of days ago so it should be out any day now. Note that people have been using spark 2.1 for production in structured streaming so you should be able to start

RE: underlying checkpoint

2017-07-16 Thread Mendelson, Assaf
Actually, show is an action. The issue is that unless you have some aggregations, show will only go over some of the dataframe, not all of it and therefore the caching won’t occur (similar to what happens with cache). You need an action which requires to go over the entire dataframe (which count

RE: mapPartitioningWithIndex in Dataframe

2017-08-05 Thread Mendelson, Assaf
First I believe you mean on the Dataset API rather than the dataframe API. You can easily add the partition index as a new column to your dataframe using spark_partition_id() Then a normal mapPartitions should work fine (i.e. you should create the appropriate case class which includes the

[WARN] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2017-05-10 Thread Mendelson, Assaf
Hi all, When running spark I get the following warning: [WARN] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Now I know that in general it is possible to ignore this warning, however, it means that

incremental broadcast join

2017-05-10 Thread Mendelson, Assaf
Hi, It seems as if when doing broadcast join, the entire dataframe is resent even if part of it has already been broadcasted. Consider the following case: val df1 = ??? val df2 = ??? val df3 = ??? df3.join(broadcast(df1), on=cond, "left_outer") followed by df4.join(broadcast(df1.union(df2),

RE: [WARN] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2017-05-17 Thread Mendelson, Assaf
Thanks for the response. I will try with log4j. That said, I am running in windows using winutil.exe and still getting the warning. Thanks, Assaf. From: Steve Loughran [mailto:ste...@hortonworks.com] Sent: Tuesday, May 16, 2017 6:55 PM To: Mendelson, Assaf Cc: user

RE: Merging multiple Pandas dataframes

2017-06-21 Thread Mendelson, Assaf
, Assaf. From: Saatvik Shah [mailto:saatvikshah1...@gmail.com] Sent: Tuesday, June 20, 2017 8:50 PM To: Mendelson, Assaf Cc: user@spark.apache.org Subject: Re: Merging multiple Pandas dataframes Hi Assaf, Thanks for the suggestion on checkpointing - I'll need to read up more on that. My

RE: Merging multiple Pandas dataframes

2017-06-20 Thread Mendelson, Assaf
Note that depending on the number of iterations, the query plan for the dataframe can become long and this can cause slowdowns (or even crashes). A possible solution would be to checkpoint (or simply save and reload the dataframe) every once in a while. When reloading from disk, the newly loaded

strange warning

2017-05-25 Thread Mendelson, Assaf
Hi all, Today, I got the following warning: [WARN] org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl This occurs on one of my tests but not on

[WARN] org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextI

2017-05-28 Thread Mendelson, Assaf
Hi, I am getting the following warning: [WARN] org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl This seems to occur every time I try to read from

RE: strange warning

2017-05-25 Thread Mendelson, Assaf
ad.parquet(filename) df2.show() Any ideas? Thanks, Assaf. From: Mendelson, Assaf [mailto:assaf.mendel...@rsa.com] Sent: Thursday, May 25, 2017 9:55 AM To: user@spark.apache.org Subject: strange warning Hi all, Today, I got the following warning: [WARN] org.apache.parquet.hadoop.ParquetRe

having trouble using structured streaming with file sink (parquet)

2017-06-13 Thread Mendelson, Assaf
Hi all, I have recently started assessing structured streaming and ran into a little snag from the beginning. Basically I wanted to read some data, do some basic aggregation and write the result to file: import org.apache.spark.sql.functions.avg import

RE: [How-To] Custom file format as source

2017-06-12 Thread Mendelson, Assaf
Try https://mapr.com/blog/spark-data-source-api-extending-our-spark-sql-query-engine/ Thanks, Assaf. -Original Message- From: OBones [mailto:obo...@free.fr] Sent: Monday, June 12, 2017 1:01 PM To: user@spark.apache.org Subject: [How-To] Custom file format as source

RE: Spark Web UI SSL Encryption

2017-08-21 Thread Mendelson, Assaf
The following is based on stuff I did a while ago so I might be missing some parts. First you need to create a certificate. The following example creates a self-signed one: openssl genrsa -aes128 -out sparkssl.key 2048 -alias "standalone" openssl rsa -in sparkssl.key -pubout -out

RE: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Mendelson, Assaf
Could you add a fuller code example? I tried to reproduce it in my environment and I am getting just one instance of the reader… Thanks, Assaf From: Shubham Chaurasia [mailto:shubh.chaura...@gmail.com] Sent: Tuesday, October 9, 2018 9:31 AM To: user@spark.apache.org Subject:

RE: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Mendelson, Assaf
[mailto:shubh.chaura...@gmail.com] Sent: Tuesday, October 9, 2018 2:02 PM To: Mendelson, Assaf; user@spark.apache.org Subject: Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state [EXTERNAL EMAIL] Please report any suspicious attachments, links