Spark Sparser library

2018-08-09 Thread umargeek
Hi Team, Please let me know the spark Sparser library to use while submitting the spark application to use below mentioned format, val df = spark.read.format("*edu.stanford.sparser.json*") When I used above format it throwed error class not found exception. Thanks, Umar -- Sent from:

MultilayerPerceptronClassifier

2018-08-09 Thread Mina Aslani
Hi, I have couple of questions regarding using MultilayerPerceptronClassifier in pyspark. - Do we have any other options for solver other than solver='l-bfgs'? - Also, tried to tune using cross validation and find the best model based on the defined parameters. (e.g. maxIter, layers, tol, seed).

[Structured Streaming] Two watermarks and StreamingQueryListener

2018-08-09 Thread subramgr
Hi, We have two *flatMapGroupWithState* in our job and we have two *withWatermark* We are getting the event max time, event time and watermarks from *QueryProgressEvent*. Right now it just returns one *watermark* value. Are two watermarks maintained by Spark or just one. If one which one

How does mapPartitions function work in Spark streaming on DStreams?

2018-08-09 Thread zakhavan
Hello, I am running a spark streaming program on a dataset which is a sequence of numbers in a text file format. I read the text file and load it into a Kafka topic and then run the Spark streaming program on the DStream and finally write the result into an output text file. But I'm getting

Kryoserializer with pyspark

2018-08-09 Thread Hichame El Khalfi
Hello there !!! Is there any benefit from tuning `spark.kryoserializer.buffer` and `spark.kryoserializer.buffer.max` if we just use pyspark wth no Java or Scala classes ? Thanks for your help, Hichame

Re: Implementing .zip file codec

2018-08-09 Thread mytramesh
Spark doesn't support zip file reading directly since this not distributable file . Read using Java.uti.zipInputStream api and prepare rdd .. ( 4GB Limit ) import java.util.zip.ZipInputStream import scala.io.Source import org.apache.spark.input.PortableDataStream var zipPath = "s3://

Re: groupBy and then coalesce impacts shuffle partitions in unintended way

2018-08-09 Thread Koert Kuipers
thanks for that long reply jungtaek! so when i set spark.sql.shuffle.partitions to 2048 i have 2048 data partitions (or "partitions of state"). these are determined by a hashing function. ok got it! when i look at the application manager i also see 2048 "tasks" for the relevant stage. so tasks

Re: Structured Streaming doesn't write checkpoint log when I use coalesce

2018-08-09 Thread Jungtaek Lim
Which version do you use? Above app works with Spark 2.3.1, 200 partitions are stored for State. val queryStatusFile = conf.queryStatusFile() val rateRowPerSecond = conf.rateRowPerSecond() val rateRampUpTimeSecond = conf.rateRampUpTimeSecond() val ss = SparkSession

Structured Streaming doesn't write checkpoint log when I use coalesce

2018-08-09 Thread WangXiaolong
Hi, Lately, I encountered a problem, when I was writing as structured streaming job to write things into opentsdb. The write-stream part looks something like outputDs .coalesce(14) .writeStream .outputMode("append")

Understanding spark.executor.memoryOverhead

2018-08-09 Thread Akash Mishra
Hi *, I would like to know what part of spark codebase uses spark.executor.memoryOverhead? I have a job which has *spark.executor.memory=18g *but it requires spark.executor.memoryOverhead=4g for the same process otherwise I get task error, ExecutorLostFailure (executor 28 exited caused by one of

Re: groupBy and then coalesce impacts shuffle partitions in unintended way

2018-08-09 Thread Jungtaek Lim
I could be wrong so welcome anyone to correct me if I'm missing here. You can expect Spark operators in narrow dependency as applying wrapped functions to an iterator (like "op3(op2(op1(iter)))"), and with such expectation there's no way to achieve adjusting partitions. Each partition is

Re: groupBy and then coalesce impacts shuffle partitions in unintended way

2018-08-09 Thread Koert Kuipers
well an interesting side effect of this is that i can now control the number of partitions for every shuffle in a dataframe job, as opposed to having a single setting for number of partitions across all shuffles. basically i can set spark.sql.shuffle.partitions to some huge number, and then for

Error in java_gateway.py

2018-08-09 Thread ClockSlave
Following the code snippets on this thread , I got a working version of XGBoost on pyspark. But one issue I am still facing is the following File