Re: How to parallelize zip file processing?

2018-08-10 Thread Jörn Franke
Does the zip file contain only one file? I fear in this case you can only have one core. Do you mean by the way gzip? In this case you cannot decompress it in parallel... How is the zip file created ? Can’t you create several ones? > On 10. Aug 2018, at 22:54, mytramesh wrote: > > I know,

Re: [Structured Streaming] Two watermarks and StreamingQueryListener

2018-08-10 Thread Tathagata Das
Structured Streaming internally maintains one global watermark by taking a min of the two watermarks. Thats why one gets reported. In Spark 2.4, there will be the option of choosing max instead of min. Just curious. Why do you have to two watermarks? Whats the query like. TD On Thu, Aug 9, 2018

How to parallelize zip file processing?

2018-08-10 Thread mytramesh
I know, spark doesn’t support zip file directly since it not distributable. Any techniques to process this file quickly? I am trying to process around 4GB zip file. All data is moving one executor, and only one task is getting assigned to process all the data. Even when I run repartition

unsubscribe

2018-08-10 Thread Ryan Adams
Ryan Adams radams...@gmail.com

How to get MultilayerPerceptronClassifier model parameters?

2018-08-10 Thread Mina Aslani
Hi, How can I get the parameters of my MultilayerPerceptronClassifier model? I only can get the layers parameter using myModel.layers. For other parameters, when I use myModel.getSeed()/myModel.getTol()/myModel. getMaxIter() I get below error: 'MultilayerPerceptronClassificationModel' object has

Why is the max iteration for svd not configurable in mllib?

2018-08-10 Thread Sam Lendle
https://github.com/apache/spark/blob/f5aba657396bd4e2e03dd06491a2d169a99592a7/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L191 maxIter is set to max(300, 3 * # singular values). Is there a particular reason for this? And if not, would it be appropriate to submit

Re: [Structured Streaming] Two watermarks and StreamingQueryListener

2018-08-10 Thread Girish Subramanian
Thanks for the explanation. We are doing something like this. The *first* watermark is to eliminate the late events from *kafka* The *second* watermark is to eliminate older aggregated metrics across *sessions* I know I can replace the second one with *window* but I was not able to come up

Re: Run/install tensorframes on zeppelin pyspark

2018-08-10 Thread Spico Florin
Hello! Thank you very much for your response. As I understood, in order to use tensorframes in Zeppelin pyspark notebook with spark master locally 1. we should run command pip install tensorframes 2. we should set up the PYSPARK_PYTHON in conf/zeppelin-env.sh I have performed the above steps

Re: Spark Sparser library

2018-08-10 Thread Jörn Franke
You need to include the library in your dependencies. Furthermore the * does not make sense in the end. > On 10. Aug 2018, at 07:48, umargeek wrote: > > Hi Team, > > Please let me know the spark Sparser library to use while submitting the > spark application to use below mentioned format, >

Using Logback.xml with Spark

2018-08-10 Thread adithya kanumalla
Hi All, Can you please let me know if any of you have been successful in using Logback.xml in conjunction with Apache Spark 2.X and have been able to do a spark submit to yarn I have tried the below solution and it does'nt work with spark-submit to yarn