RE: [Spark 1.5]: Exception in thread "broadcast-hash-join-2" java.lang.OutOfMemoryError: Java heap space -- Work in 1.4, but 1.5 doesn't

2015-11-04 Thread Shuai Zheng
And an update is: this ONLY happen in Spark 1.5, I try to run it under Spark 1.4 and 1.4.1, there are no issue (the program is developed under Spark 1.4 last time, and I just re-test it, it works). So this is proven that there is no issue on the logic and data, it is caused by the new version of

Re: kerberos question

2015-11-04 Thread Chen Song
After a bit more investigation, I found that it could be related to impersonation on kerberized cluster. Our job is started with the following command. /usr/lib/spark/bin/spark-submit --master yarn-client --principal [principle] --keytab [keytab] --proxy-user [proxied_user] ... In application

Re: kerberos question

2015-11-04 Thread Ted Yu
2015-11-04 10:03:31,905 ERROR [Delegation Token Refresh Thread-0] hdfs.KeyProviderCache (KeyProviderCache.java:createKeyProviderURI(87)) - Could not find uri with key [dfs. encryption.key.provider.uri] to create a keyProvider !! Could it be related to HDFS-7931 ? On Wed, Nov 4, 2015 at 12:30

PairRDD from SQL

2015-11-04 Thread pratik khadloya
Hello, Is it possible to have a pair RDD from the below SQL query. The pair being ((item_id, flight_id), metric1) item_id, flight_id are part of group by. SELECT item_id, flight_id, SUM(metric1) AS metric1 FROM mytable GROUP BY item_id, flight_id Thanks, Pratik

Re: apply simplex method to fix linear programming in spark

2015-11-04 Thread Debasish Das
Yeah for this you can use breeze quadratic minimizer...that's integrated with spark in one of my spark pr. You have quadratic objective with equality which is primal and your proximal is positivity that we already support. I have not given an API for linear objective but that should be simple to

RE: Rule Engine for Spark

2015-11-04 Thread Cheng, Hao
Or try Streaming SQL? Which is a simple layer on top of the Spark Streaming. ☺ https://github.com/Intel-bigdata/spark-streamingsql From: Cassa L [mailto:lcas...@gmail.com] Sent: Thursday, November 5, 2015 8:09 AM To: Adrian Tanase Cc: Stefano Baghino; user Subject: Re: Rule Engine for Spark

Re: Memory are not used according to setting

2015-11-04 Thread Shixiong Zhu
You should use `SparkConf.set` rather than `SparkConf.setExecutorEnv`. For driver configurations, you need to set them before starting your application. You can use the `--conf` argument before running `spark-submit`. Best Regards, Shixiong Zhu 2015-11-04 15:55 GMT-08:00 William Li

how to run RStudio or RStudio Server on ec2 cluster?

2015-11-04 Thread Andy Davidson
Hi I just set up a spark cluster on AWS ec2 cluster. In the past I have done a lot of work using RStudio on my local machine. Bin/sparkR looks interesting how ever it looks like you just get an R command line interpreter. Does anyone have an experience using something like RStudio or Rstudio

Question about Spark shuffle read size

2015-11-04 Thread Dogtail L
Hi all, When I run WordCount using Spark, I find that when I set "spark.default.parallelism" to different numbers, the Shuffle Write size and Shuffle Read size will change as well (I read these data from history server's web UI). Is it because the shuffle write size also include some metadata

How to unpersist a DStream in Spark Streaming

2015-11-04 Thread swetha
Hi, How to unpersist a DStream in Spark Streaming? I know that we can persist using dStream.persist() or dStream.cache. But, I don't see any method to unPersist. Thanks, Swetha -- View this message in context:

Memory are not used according to setting

2015-11-04 Thread William Li
Hi All - I have a four worker node cluster, each with 8GB memory. When I submit a job, the driver node takes 1gb memory, each worker node only allocates one executor, also just take 1gb memory. The setting of the job has: sparkConf .setExecutorEnv("spark.driver.memory", "6g")

ExecutorId in JAVA_OPTS

2015-11-04 Thread surbhi.mungre
I was trying to profile some Spark jobs and I want to collect Java Flight Recorder(JFR) files from each executor. I am running my job on a YARN cluster with several nodes, so I cannot manually collect JRF file for each run. MR provides a way to name JFR files generated by each task with taskId.

Re: Rule Engine for Spark

2015-11-04 Thread Cassa L
Thanks for reply. How about DROOLs. Does it worj with Spark? LCassa On Wed, Nov 4, 2015 at 3:02 AM, Adrian Tanase wrote: > Another way to do it is to extract your filters as SQL code and load it in > a transform – which allows you to change the filters at runtime. > >

Protobuff 3.0 for Spark

2015-11-04 Thread Cassa L
Hi, Does spark support protobuff 3.0? I used protobuff 2.5 with spark-1.4 built for HDP 2.3. Given that protobuff has compatibility issues , want to know if spark supports protbuff 3.0 LCassa

Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-04 Thread Cheng Lian
Is there any chance that " spark.sql.hive.convertMetastoreParquet" is turned off? Cheng On 11/4/15 5:15 PM, Rex Xiong wrote: Thanks Cheng Lian. I found in 1.5, if I use spark to create this table with partition discovery, the partition pruning can be performed, but for my old table

Spark reading from S3 getting very slow

2015-11-04 Thread Younes Naguib
Hi all, I'm reading large text files from s3. Sizes between from 30GB and 40GB. Every stage runs in 8-9s, except the last 32, jumps to 1mn-2mn for some reason! Here is my sample code: val myDF = sc.textFile(input_file).map{ x => val p = x.split("\t", -1) new

Re: PairRDD from SQL

2015-11-04 Thread Stéphane Verlet
sqlContext.sql().map(row=> ((row.getString(0), row.getString(1)),row.getInt(2))) On Wed, Nov 4, 2015 at 1:44 PM, pratik khadloya wrote: > Hello, > > Is it possible to have a pair RDD from the below SQL query. > The pair being ((item_id, flight_id), metric1) > > item_id,

Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-04 Thread swetha
Hi, What is the efficient approach to save an RDD as a file in HDFS and retrieve it back? I was thinking between Avro, Parquet and SequenceFileFormart. We currently use SequenceFileFormart for one of our use cases. Any example on how to store and retrieve an RDD in an Avro and Parquet file

Re: apply simplex method to fix linear programming in spark

2015-11-04 Thread Zhiliang Zhu
Dear Debasish Das, Thanks very much for your kind reply. I am very sorry that, but may you clearify a little more about the places, since I could not find them. On Thursday, November 5, 2015 5:50 AM, Debasish Das wrote: Yeah for this you can use breeze

Futures timed out after [120 seconds].

2015-11-04 Thread Kayode Odeyemi
Hi, I'm running a Spark standalone in cluster mode (1 master, 2 workers). Everything has failed including spark-submit with errors such as "Caused by: java.lang.ClassNotFoundException: com.migration.App$$anonfun$upsert$1" Now, I've reverted back to submitting jobs through scala apps. Any ideas

Re: Protobuff 3.0 for Spark

2015-11-04 Thread Lan Jiang
I have used protobuf 3 successfully with Spark on CDH 5.4, even though Hadoop itself comes with protobuf 2.5. I think the steps apply to HDP too. You need to do the following 1. Set the below parameter spark.executor.userClassPathFirst=true spark.driver.userClassPathFirst=true 2. Include

Re: how to run RStudio or RStudio Server on ec2 cluster?

2015-11-04 Thread Shivaram Venkataraman
RStudio should already be setup if you launch an EC2 cluster using spark-ec2. See http://blog.godatadriven.com/sparkr-just-got-better.html for details. Shivaram On Wed, Nov 4, 2015 at 5:11 PM, Andy Davidson wrote: > Hi > > I just set up a spark cluster on AWS ec2

Re: Rule Engine for Spark

2015-11-04 Thread Daniel Mahler
I am not familiar with any rule engines on Spark Streaming or even plain Spark Conceptually closest things I am aware of are Datomic and Bloom-lang. Neither of them are Spark based but they implement Datalog like languages over distributed stores. - http://www.datomic.com/ -

Re: Rule Engine for Spark

2015-11-04 Thread Cassa L
ok. Let me try it. Thanks, LCassa On Wed, Nov 4, 2015 at 4:44 PM, Cheng, Hao wrote: > Or try Streaming SQL? Which is a simple layer on top of the Spark > Streaming. J > > > > https://github.com/Intel-bigdata/spark-streamingsql > > > > > > *From:* Cassa L

Re: Allow multiple SparkContexts in Unit Testing

2015-11-04 Thread Priya Ch
Already tried setting spark.driver.allowMultipleContexts to true. But it not successful. I the problem is we have different test suites which of course run in parallel. How do we stop sparkContext after each test suite and start it in the next test suite or is there any way to share sparkContext

Re: How to unpersist a DStream in Spark Streaming

2015-11-04 Thread Saisai Shao
Hi Swetha, Would you mind elaborating your usage scenario of DStream unpersisting? >From my understanding: 1. Spark Streaming will automatically unpersist outdated data (you already mentioned about the configurations). 2. If streaming job is started, I think you may lose the control of the job,

Re: How to unpersist a DStream in Spark Streaming

2015-11-04 Thread Yashwanth Kumar
Hi, DStream->Discretized Streams are made up of multiple RDDs You can unpersist each RDD by accessing the individual RDD's using dstreamrdd.foreachRDD { rdd.unpersist(). } -- View this message in context:

Re: How to unpersist a DStream in Spark Streaming

2015-11-04 Thread swetha kasireddy
Other than setting the following. sparkConf.set("spark.streaming.unpersist", "true") sparkConf.set("spark.cleaner.ttl", "7200s") On Wed, Nov 4, 2015 at 5:03 PM, swetha wrote: > Hi, > > How to unpersist a DStream in Spark Streaming? I know that we can persist > using

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-04 Thread Stefano Baghino
What scenario would you like to optimize for? If you have something more specific regarding your use case, the mailing list can surely provide you with some very good advice. If you just want to save an RDD as Avro you can use a module from Databricks (the README on GitHub

Re: Executor app-20151104202102-0000 finished with state EXITED

2015-11-04 Thread Kayode Odeyemi
I've tried that once. No job was executed on the workers. That is, the workers weren't used. What I want to achieve is to have the SparkContext use a remote spark standalone master at 192.168.2.11 (this is where I started the master with ./start-master.sh and all the slaves with

Re: Spark 1.5.1 Dynamic Resource Allocation

2015-11-04 Thread tstewart
https://issues.apache.org/jira/browse/SPARK-10790 Changed to add minExecutors < initialExecutors < maxExecutors and that works. spark-shell --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=2 --conf

[Spark 1.5]: Exception in thread "broadcast-hash-join-2" java.lang.OutOfMemoryError: Java heap space

2015-11-04 Thread Shuai Zheng
Hi All, I have a program which actually run a bit complex business (join) in spark. And I have below exception: I running on Spark 1.5, and with parameter: spark-submit --deploy-mode client --executor-cores=24 --driver-memory=2G --executor-memory=45G -class . Some other setup:

Re: Executor app-20151104202102-0000 finished with state EXITED

2015-11-04 Thread Kayode Odeyemi
Thanks Ted. Where would you suggest I add that? I'm creating a SparkContext from a Spark app. My conf setup looks like this: conf.setMaster("spark://192.168.2.11:7077") conf.set("spark.logConf", "true") conf.set("spark.akka.logLifecycleEvents", "true") conf.set("spark.executor.memory", "5g") On

Re: Allow multiple SparkContexts in Unit Testing

2015-11-04 Thread Bryan Jeffrey
Priya, If you're trying to get unit tests running local spark contexts, you can just set up your spark context with 'spark.driver.allowMultipleContexts' set to true. Example: def create(seconds : Int, appName : String): StreamingContext = { val master = "local[*]" val conf = new

Re: Executor app-20151104202102-0000 finished with state EXITED

2015-11-04 Thread Ted Yu
Have you tried using -Dspark.master=local ? Cheers On Wed, Nov 4, 2015 at 10:47 AM, Kayode Odeyemi wrote: > Hi, > > I can't seem to understand why all created executors always fail. > > I have a Spark standalone cluster setup make up of 2 workers and 1 master. > My spark-env

Re: Executor app-20151104202102-0000 finished with state EXITED

2015-11-04 Thread Ted Yu
Something like this: conf.setMaster("local[3]") On Wed, Nov 4, 2015 at 11:08 AM, Kayode Odeyemi wrote: > Thanks Ted. > > Where would you suggest I add that? I'm creating a SparkContext from a > Spark app. My conf setup looks like this: > >

Re: PMML version in MLLib

2015-11-04 Thread Fazlan Nazeem
Thanks Owen. Will do it On Wed, Nov 4, 2015 at 5:22 PM, Sean Owen wrote: > I'm pretty sure that attribute is required. I am not sure what PMML > version the code has been written for but would assume 4.2.1. Feel > free to open a PR to add this version to all the output. > >

SparkSQL JDBC to PostGIS

2015-11-04 Thread Mustafa Elbehery
Hi Folks, I am trying to connect from SparkShell to PostGIS Database. Simply PostGIS is a *spatial *extension for Postgresql, in order to support *geometry * types. Although the JDBC connection from spark works well with Postgresql, it does not with a database on the same server, which supports

Re: Codegen In Shuffle

2015-11-04 Thread 牛兆捷
I see. Thanks very much. 2015-11-04 16:25 GMT+08:00 Reynold Xin : > GenerateUnsafeProjection -- projects any internal row data structure > directly into bytes (UnsafeRow). > > > On Wed, Nov 4, 2015 at 12:21 AM, 牛兆捷 wrote: > >> Dear all: >> >> Tungsten

Re: Allow multiple SparkContexts in Unit Testing

2015-11-04 Thread Ted Yu
Are you trying to speed up tests where each test suite uses single SparkContext ? You may want to read: https://issues.apache.org/jira/browse/SPARK-2243 Cheers On Wed, Nov 4, 2015 at 4:59 AM, Priya Ch wrote: > Hello All, > > How to use multiple Spark Context in

Re: Why some executors are lazy?

2015-11-04 Thread Adrian Tanase
Your solution is an easy, pragmatic one, however there are many factors involved – and it’s not guaranteed to work for other data sets. It depends on: * The distribution of data on the keys. Random session ids will naturally distribute better than “customer names” - you can mitigate this

Problem using BlockMatrix.add

2015-11-04 Thread Kareem Sorathia
Hi, I'm attempting to use the distributed matrix data structure BlockMatrix (Spark 1.5.0, scala) and having some issues when attempting to add two block matrices together (error attached below). I'm constructing the two matrices by creating a collection of MatrixEntry's, putting that into

Re: SparkSQL JDBC to PostGIS

2015-11-04 Thread Stefano Baghino
Hi Mustafa, are you trying to run geospatial queries on the PostGIS DB with SparkSQL? Correct me if I'm wrong, but I think SparkSQL itself would need to support the geospatial extensions in order for this to work. On Wed, Nov 4, 2015 at 1:46 PM, Mustafa Elbehery

Allow multiple SparkContexts in Unit Testing

2015-11-04 Thread Priya Ch
Hello All, How to use multiple Spark Context in executing multiple test suite of spark code ??? Can some one throw light on this ?

Re: Spark Streaming data checkpoint performance

2015-11-04 Thread Adrian Tanase
Nice! Thanks for sharing, I wasn’t aware of the new API. Left some comments on the JIRA and design doc. -adrian From: Shixiong Zhu Date: Tuesday, November 3, 2015 at 3:32 AM To: Thúy Hằng Lê Cc: Adrian Tanase, "user@spark.apache.org" Subject: Re: Spark Streaming

Re: PMML version in MLLib

2015-11-04 Thread Sean Owen
I'm pretty sure that attribute is required. I am not sure what PMML version the code has been written for but would assume 4.2.1. Feel free to open a PR to add this version to all the output. On Wed, Nov 4, 2015 at 11:42 AM, Fazlan Nazeem wrote: > [adding dev] > > On Wed, Nov

Re: Why some executors are lazy?

2015-11-04 Thread Adrian Tanase
If some of the operations required involve shuffling and partitioning, it might mean that the data set is skewed to specific partitions which will create hot spotting on certain executors. -adrian From: Khaled Ammar Date: Tuesday, November 3, 2015 at 11:43 PM To:

Distributing Python code packaged as tar balls

2015-11-04 Thread Praveen Chundi
Hi, Pyspark/spark-submit offers a --py-files handle to distribute python code for execution. Currently(version 1.5) only zip files seem to be supported, I have tried distributing tar balls unsuccessfully. Is it worth adding support for tar balls? Best regards, Praveen Chundi

Re: PMML version in MLLib

2015-11-04 Thread Fazlan Nazeem
[adding dev] On Wed, Nov 4, 2015 at 2:27 PM, Fazlan Nazeem wrote: > I just went through all specifications, and they expect the version > attribute. This should be addressed very soon because if we cannot use the > PMML model without the version attribute, there is no use of

Re: Parsing a large XML file using Spark

2015-11-04 Thread Jin
I recently worked around datasource and parquet a bit at Spark and someone requested me to make a XML datasource plugin. So iI did this. https://github.com/HyukjinKwon/spark-xml It tried to get rid of in-line format just like Json datasource in Spark. Altough I didn't add a CI tool for this

Looking for the method executors uses to write to HDFS

2015-11-04 Thread Tóth Zoltán
Hi, I'd like to write a parquet file from the driver. I could use the HDFS API but I am worried that it won't work on a secure cluster. I assume that the method the executors use to write to HDFS takes care of managing Hadoop security. However, I can't find the place where HDFS write happens in

Re: Why some executors are lazy?

2015-11-04 Thread Khaled Ammar
Thank you Adrian, The dataset is indeed skewed. My concern was that some executors do not participate in computation at all. I understand that executors finish tasks sequentially. Therefore, using more executors allow for better parallelism. I managed to force all executors to participate by

Problem using BlockMatrix.add

2015-11-04 Thread ksorat
Hi, I'm attempting to use the distributed matrix data structure BlockMatrix (Spark 1.5.0, scala) and having some issues when attempting to add two block matrices together (error attached below). I'm constructing the two matrices by creating a collection of MatrixEntry's, putting that into

Spark driver, Docker, and Mesos

2015-11-04 Thread PHELIPOT, REMY
Hello, I’m trying to run a spark shell (or a spark notebook) connect to a mesos master inside a docker container. My goal is to give access to the mesos cluster for several user at the same time. It seems not possible to get it working with the container started in bridge mode, isn’t it? Is

Spark 1.5.1 Dynamic Resource Allocation

2015-11-04 Thread tstewart
(apologies if this re-posts, having challenges with the various web front ends to this mailing list) I am running the following command on a Hadoop cluster to launch Spark shell with DRA: spark-shell --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf

DataFrame.toJavaRDD cause fetching data to driver, is it expected ?

2015-11-04 Thread Aliaksei Tsyvunchyk
Hello folks, Recently I have noticed unexpectedly big network traffic between Driver Program and Worker node. During debugging I have figured out that it is caused by following block of code —— Java ——— — DataFrame etpvRecords = context.sql(" SOME SQL query here"); Mapper m = new

Re: Is the resources specified in configuration shared by all jobs?

2015-11-04 Thread Marcelo Vanzin
Resources belong to the application, not each job, so the latter. On Wed, Nov 4, 2015 at 9:24 AM, Nisrina Luthfiyati wrote: > Hi all, > > I'm running some spark jobs in java on top of YARN by submitting one > application jar that starts multiple jobs. > My question

Re: Is the resources specified in configuration shared by all jobs?

2015-11-04 Thread Sandy Ryza
Hi Nisrina, The resources you specify are shared by all jobs that run inside the application. -Sandy On Wed, Nov 4, 2015 at 9:24 AM, Nisrina Luthfiyati < nisrina.luthfiy...@gmail.com> wrote: > Hi all, > > I'm running some spark jobs in java on top of YARN by submitting one > application jar

Re: DataFrame.toJavaRDD cause fetching data to driver, is it expected ?

2015-11-04 Thread Romi Kuntsman
I noticed that toJavaRDD causes a computation on the DataFrame, so is it considered an action, even though logically it's a transformation? On Nov 4, 2015 6:51 PM, "Aliaksei Tsyvunchyk" wrote: > Hello folks, > > Recently I have noticed unexpectedly big network traffic

Re: Spark driver, Docker, and Mesos

2015-11-04 Thread Timothy Chen
Hi Remy, Yes with docker bridge network it's not possible yet, with host network it should work. I was planning to create a ticket and possibly work on that in the future as there are some changes on the spark side. Tim > On Nov 4, 2015, at 8:24 AM, PHELIPOT, REMY

Is the resources specified in configuration shared by all jobs?

2015-11-04 Thread Nisrina Luthfiyati
Hi all, I'm running some spark jobs in java on top of YARN by submitting one application jar that starts multiple jobs. My question is, if I'm setting some resource configurations, either when submitting the app or in spark-defaults.conf, would this configs apply to each job or the entire

Running Apache Spark 1.5.1 on console2

2015-11-04 Thread Hitoshi Ozawa
I have Spark 1.5.1 running directly on Windows7 but would like to run it on console2. I have JAVA_HOME, SCALA_HOME, and SPARK_HOME setup and have verified Java and Scala are working property (did a -version and able to run programs). However, when I try to use Spark using "spark-shell" it return

Re: PMML version in MLLib

2015-11-04 Thread Fazlan Nazeem
Hi Stefano, Although the intention for my question wasn't as you expected, what you say makes sense. The standard[1] for PMML 4.1 specifies that "*For PMML 4.1 the attribute version must have the value 4.1". *I'm not sure whether that means that other PMML versions do not need that attribute to

Re: apply simplex method to fix linear programming in spark

2015-11-04 Thread Zhiliang Zhu
Hi Debasish Das, Firstly I must show my deep appreciation towards you kind help. Yes, my issue is some typical LP related, it is as:Objective function:f(x1, x2, ..., xn) = a1 * x1 + a2 * x2 + ... + an * xn,   (n would be some number bigger than 100) There are only 4 constraint functions,x1 + x2

Re: apply simplex method to fix linear programming in spark

2015-11-04 Thread Zhiliang Zhu
Hi Debasish Das, I found that there are lots of much useful information for me in your kind reply.However, I am sorry that still I could not exactly catch each words you said. I just know spark mllib will use breeze as its underlying package, however, I did not practise and do not know how to

spark filter function

2015-11-04 Thread Zhiliang Zhu
Hi All, I would like to filter some elements in some given RDD, only the needed left, at the time the row number of the result RDD is smaller. Then I select filter function, however, by test, filter function would only accept Boolean type, that is to say, will only JavaRDDbe returned for

Re: PMML version in MLLib

2015-11-04 Thread Stefano Baghino
I used KNIME, which internally uses the org.dmg.pmml library. On Wed, Nov 4, 2015 at 9:45 AM, Fazlan Nazeem wrote: > Hi Stefano, > > Although the intention for my question wasn't as you expected, what you > say makes sense. The standard[1] for PMML 4.1 specifies that "*For

Re: PMML version in MLLib

2015-11-04 Thread Fazlan Nazeem
I just went through all specifications, and they expect the version attribute. This should be addressed very soon because if we cannot use the PMML model without the version attribute, there is no use of generating one without it. On Wed, Nov 4, 2015 at 2:17 PM, Stefano Baghino <

Re: Codegen In Shuffle

2015-11-04 Thread Reynold Xin
GenerateUnsafeProjection -- projects any internal row data structure directly into bytes (UnsafeRow). On Wed, Nov 4, 2015 at 12:21 AM, 牛兆捷 wrote: > Dear all: > > Tungsten project has mentioned that they are applying code generation is > to speed up the conversion of data

Re: dataframe slow down with tungsten turn on

2015-11-04 Thread gen tang
Yes, the same code, the same result. In fact, the code has been running for a more one month. Before 1.5.0, the performance is quite the same, So I doubt that it is causd by tungsten. Gen On Wed, Nov 4, 2015 at 4:05 PM, Rick Moritz wrote: > Something to check (just in case):

Re: dataframe slow down with tungsten turn on

2015-11-04 Thread Rick Moritz
Something to check (just in case): Are you getting identical results each time? On Wed, Nov 4, 2015 at 8:54 AM, gen tang wrote: > Hi sparkers, > > I am using dataframe to do some large ETL jobs. > More precisely, I create dataframe from HIVE table and do some operations. >

Dynamic (de)allocation with Spark Streaming

2015-11-04 Thread Wojciech Pituła
Hi, I have some doubts about dynamic resource allocation with spark streaming. If spark had allocated 5 executors for me, then he would dispatch every batch tasks on all of them equally. So if batchSize < spark.dynamicAllocation.executorIdleTimeout then spark will never free any executor.

Codegen In Shuffle

2015-11-04 Thread 牛兆捷
Dear all: Tungsten project has mentioned that they are applying code generation is to speed up the conversion of data from in-memory binary format to wire-protocol for shuffle. Where can I find the related implementation in spark code-based ? -- *Regards,* *Zhaojie*

Re: PMML version in MLLib

2015-11-04 Thread Stefano Baghino
Hi Fazian, I actually had a problem with an invalid PMML produced by Spark 1.5.1 due to the missing "version" attribute in the "PMML" tag. Is this your case too? I've briefly checked the PMML standard and that attribute is required, so this may be an issue that should be addressed. I'll happily

Re: DataFrame.toJavaRDD cause fetching data to driver, is it expected ?

2015-11-04 Thread Aliaksei Tsyvunchyk
Hello Romi, Do you mean that in my particular case I’m causing computation on dataFrame or it is regular behavior of DataFrame.toJavaRDD ? If it’s regular behavior, do you know which approach could be used to perform make/reduce on dataFrame without causing it to load all data to driver

Re: Is the resources specified in configuration shared by all jobs?

2015-11-04 Thread Nisrina Luthfiyati
Got it. Thanks! On Nov 5, 2015 12:32 AM, "Sandy Ryza" wrote: > Hi Nisrina, > > The resources you specify are shared by all jobs that run inside the > application. > > -Sandy > > On Wed, Nov 4, 2015 at 9:24 AM, Nisrina Luthfiyati < > nisrina.luthfiy...@gmail.com> wrote: >

Re: Prevent possible out of memory when using read/union

2015-11-04 Thread Sujit Pal
Hi Alexander, You may want to try the wholeTextFiles() method of SparkContext. Using that you could just do something like this: sc.wholeTextFiles("hdfs://input_dir") > .saveAsSequenceFile("hdfs://output_dir") The wholeTextFiles returns a RDD of ((filename, content)).

Re: DataFrame.toJavaRDD cause fetching data to driver, is it expected ?

2015-11-04 Thread Romi Kuntsman
In my program I move between RDD and DataFrame several times. I know that the entire data of the DF doesn't go into the driver because it wouldn't fit there. But calling toJavaRDD does cause computation. Check the number of partitions you have on the DF and RDD... On Nov 4, 2015 7:54 PM,

SPARK_SSH_FOREGROUND format

2015-11-04 Thread Kayode Odeyemi
From http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts : If you do not have a password-less setup, you can set the environment > variable SPARK_SSH_FOREGROUND and serially provide a password for each > worker. > What does "serially provide a password for each

Re: DataFrame.toJavaRDD cause fetching data to driver, is it expected ?

2015-11-04 Thread Aliaksei Tsyvunchyk
Hi Romi, Thank for pointing me. I quite new in Spark and not sure how it can help when I’ll check number of partitions in DF and RDD, so if you can give me some explanation it would be really helpful. Link to documentation will also help. > On Nov 4, 2015, at 1:05 PM, Romi Kuntsman

Executor app-20151104202102-0000 finished with state EXITED

2015-11-04 Thread Kayode Odeyemi
Hi, I can't seem to understand why all created executors always fail. I have a Spark standalone cluster setup make up of 2 workers and 1 master. My spark-env looks like this: SPARK_MASTER_IP=192.168.2.11 SPARK_LOCAL_IP=192.168.2.11 SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4"

Re: [Spark MLlib] about linear regression issue

2015-11-04 Thread Zhiliang Zhu
Hi DB Tsai, Firstly I must show my deep appreciation towards your kind help. Did you just mean like this, currently there is no way for users to deal with constrains like all weights >= 0 in spark, though spark also has LBFGS ... Moreover, I did not know whether spark SVD will help some for that

Prevent possible out of memory when using read/union

2015-11-04 Thread Alexander Lenz
Hi colleagues, In Hadoop I have a lot of folders containing small files. Therefore I am reading the content of all folders, union the small files and write the unioned data into a single folder containing one file. Afterwards I delete the small files and the according folders. I see two