exhaustive list of configuration options

2018-11-19 Thread Shiyuan
Hi Spark Users, Is there a way I can get the exhaustive list of configuration options and their default values? The documentation page https://spark.apache.org/docs/latest/configuration.html is not exhaustive. The Spark UI/environment tab is not exhaustive either. Thank you!

PySpark Streaming and Secured Kafka.

2018-11-19 Thread Ramaswamy, Muthuraman
Hi All, I would like to use PySpark Streaming with secured Kafka as the source stream. What options or arguments that I should pass in spark-submit command? A sample spark-submit command with all the required options/arguments to access a remote-secured Kafka will help. Thank you, ~Muthu R

Re: [Spark Structued Streaming]: Read kafka offset from a timestamp

2018-11-19 Thread Jungtaek Lim
It really depends on whether we use it only for starting query (instead of restoring from checkpoint) or we would want to restore the previous batch from specific time (with restoring state as well). For former would make sense and I'll try to see whether I can address it. For latter that doesn't

Re: [Spark SQL] [Spark 2.4.0] v1 -> struct(v1.e) fails

2018-11-19 Thread kathleen li
How about this: df.select(expr("transform( b, v1 -> struct(v1) )")).show() + |transform(b, lambdafunction(named_struct(v1, namedlambdavariable()), namedlambdavariable()))|

Re: streaming pdf

2018-11-19 Thread Jörn Franke
Well, I am not so sure about the use cases, but what about using StreamingContext.fileStream?

Re: streaming pdf

2018-11-19 Thread Jörn Franke
And you have to write your own input format, but this is not so complicated (probably anyway recommended for the PDF case) > Am 20.11.2018 um 08:06 schrieb Jörn Franke : > > Well, I am not so sure about the use cases, but what about using > StreamingContext.fileStream? >

Spark DataSets and multiple write(.) calls

2018-11-19 Thread Dipl.-Inf. Rico Bergmann
Hi! I have a SparkSQL programm, having one input and 6 ouputs (write). When executing this programm every call to write(.) executes the plan. My problem is, that I want all these writes to happen in parallel (inside one execution plan), because all writes have a common and compute intensive

Re: streaming pdf

2018-11-19 Thread Nicolas Paris
On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote: > Why does it have to be a stream? > Right now I manage the pipelines as spark batch processing. Mooving to stream would add some improvements such: - simplification of the pipeline - more frequent data ingestion - better resource

Re: Spark DataSets and multiple write(.) calls

2018-11-19 Thread Magnus Nilsson
Magnus Nilsson 9:43 AM (0 minutes ago) to info I had the same requirements. As far as I know the only way is to extend the foreachwriter, cache the microbatch result and write to each output. https://docs.databricks.com/spark/latest/structured-streaming/foreach.html Unfortunately it seems as if

Pre build for apache 2.4 broken

2018-11-19 Thread b-moisson
Hello, It's look like dependecies for the prebuild package for spark 2.4 is wrong. When I download it from : https://spark.apache.org/downloads.html I get spark-2.4 but when I launch a spark shell and spark-version, I have got : Welcome to __ / __/__ ___ _/

Regression of external shuffle service spark 2.3 vs spark 2.2

2018-11-19 Thread igor.berman
Hi, any inputs will be welcome regarding below We are running with external shuffle service. Mesos cluster(1.5.1) After upgrading our production workload to spark 2.3 we started to see OOM failures of external shuffle services(running on each node). Does anybody experienced same problems? Any

Re: Spark DataSets and multiple write(.) calls

2018-11-19 Thread Dipl.-Inf. Rico Bergmann
Thanks for your advise. But I'm using Batch processing. Does anyone have a solution for the batch processing case? Best, Rico. Am 19.11.2018 um 09:43 schrieb Magnus Nilsson: > > > Magnus Nilsson > > > 9:43 AM (0 minutes ago) > > > to info > > I had the same

Re: Spark DataSets and multiple write(.) calls

2018-11-19 Thread Vadim Semenov
You can use checkpointing, in this case Spark will write out an rdd to whatever destination you specify, and then the RDD can be reused from the checkpointed state avoiding recomputing. On Mon, Nov 19, 2018 at 7:51 AM Dipl.-Inf. Rico Bergmann < i...@ricobergmann.de> wrote: > Thanks for your