RE: sparkR issues ?

2016-03-18 Thread Sun, Rui
Sorry. I am wrong. The issue is not related to as.data.frame(). It seems to be related to DataFrame naming conflict between s4vectors and SparkR. Refer to https://issues.apache.org/jira/browse/SPARK-12148 From: Sun, Rui [mailto:rui@intel.com] Sent: Wednesday, March 16, 2016 9:33 AM To: Alex

Re: Limit pyspark.daemon threads

2016-03-18 Thread Carlile, Ken
Thanks! I found that part just after I sent the email… whoops. I’m guessing that’s not an issue for my users, since it’s been set that way for a couple of years now.  The thread count is definitely an issue, though, since if enough nodes go down, they can’t schedule their spark

Handling Missing Values in MLLIB Decision Tree

2016-03-18 Thread Abir Chakraborty
Hello, Can MLLIB Decision Tree (DT) handle missing values by having surrogate split (as it is currently being done in "rpart" library in R)? Thanks, Abir Principal Data Scientist, Data Science Group, Innovation Labs [24]7 Inc. - The Intuitive Consumer Experience

RE: best way to do deep learning on spark ?

2016-03-18 Thread Ulanov, Alexander
Hi Charles, There is an implementation of multilayer perceptron in Spark (since 1.5): https://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier Other features such as autoencoder, convolutional layers, etc. are currently under development. Please

Re: The error to read HDFS custom file in spark.

2016-03-18 Thread Mich Talebzadeh
Hi Tony, Is com.kiisoo.aegis.bd.common.hdfs.RDRawDataRecord One of your own packages? Sounds like it is one throwing the error HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: The error to read HDFS custom file in spark.

2016-03-18 Thread Tony Liu
I also tried before, but in RawReader.next(key, value) method, invoke reader.next method get an error. it says: Type Mismatch. On Fri, Mar 18, 2016 at 12:53 AM, Benyi Wang wrote: > I would say change > > class RawDataInputFormat[LW <: LongWritable, RD <: RDRawDataRecord]

Saving intermediate results in mapPartitions

2016-03-18 Thread Krishna
Hi, I've a situation where the number of elements output by each partition from mapPartitions don't fit into the RAM even with the lowest number of rows in the partition (there is a hard lower limit on this value). What's the best way to address this problem? During the mapPartition phase, is

Re: [discuss] making SparkEnv private in Spark 2.0

2016-03-18 Thread Mridul Muralidharan
We have custom join's that leverage it. It is used to get to direct shuffle'ed iterator - without needing sort/aggregate/etc. IIRC the only way to get to it from ShuffleHandle is via shuffle manager. Regards, Mridul On Wed, Mar 16, 2016 at 3:36 PM, Reynold Xin wrote: > >

Limit pyspark.daemon threads

2016-03-18 Thread Carlile, Ken
Hello, We have an HPC cluster that we run Spark jobs on using standalone mode and a number of scripts I’ve built up to dynamically schedule and start spark clusters within the Grid Engine framework. Nodes in the cluster have 16 cores and 128GB of RAM. My users use pyspark heavily. We’ve

Fwd: DF creation

2016-03-18 Thread satyajit vegesna
Hi , I am trying to create separate val reference to object DATA (as shown below), case class data(name:String,age:String) Creation of this object is done separately and the reference to the object is stored into val data. i use val samplerdd = sc.parallelize(Seq(data)) , to create RDD.

Request for comments: Tensorframes, an integration library between TensorFlow and Spark DataFrames

2016-03-18 Thread Tim Hunter
Hello all, I would like to bring your attention to a small project to integrate TensorFlow with Apache Spark, called TensorFrames. With this library, you can map, reduce or aggregate numerical data stored in Spark dataframes using TensorFlow computation graphs. It is published as a Spark package

Re: SparkContext.stop() takes too long to complete

2016-03-18 Thread Nezih Yigitbasi
Hadoop 2.4.0. Here is the relevant logs from executor 1136 16/03/18 21:26:58 INFO mapred.SparkHadoopMapRedUtil: attempt_201603182126_0276_m_000484_0: Committed16/03/18 21:26:58 INFO executor.Executor: Finished task 484.0 in stage 276.0 (TID 59663). 1080 bytes result sent to driver16/03/18

Re: SparkContext.stop() takes too long to complete

2016-03-18 Thread Ted Yu
Which version of hadoop do you use ? bq. Requesting to kill executor(s) 1136 Can you find more information on executor 1136 ? Thanks On Fri, Mar 18, 2016 at 4:16 PM, Nezih Yigitbasi < nyigitb...@netflix.com.invalid> wrote: > Hi Spark experts, > I am using Spark 1.5.2 on YARN with dynamic

Saving the DataFrame based RandomForestClassificationModels

2016-03-18 Thread James Hammerton
Hi, If you train a org.apache.spark.ml.classification.RandomForestClassificationModel, you can't save it - attempts to do so yield the following error: 16/03/18 14:12:44 INFO SparkContext: Successfully stopped SparkContext > Exception in thread "main" java.lang.UnsupportedOperationException: >

SparkContext.stop() takes too long to complete

2016-03-18 Thread Nezih Yigitbasi
Hi Spark experts, I am using Spark 1.5.2 on YARN with dynamic allocation enabled. I see in the driver/application master logs that the app is marked as SUCCEEDED and then SparkContext stop is called. However, this stop sequence takes > 10 minutes to complete, and YARN resource manager kills the

Re: Enabling spark_shuffle service without restarting YARN Node Manager

2016-03-18 Thread Vinay Kashyap
Thanks for your reply guys. @Alex : Hope in the future releases we might get a way to this. @Saisai : The concern regarding the Node Manager restart is that, if in a shared YARN cluster running other applications as well apart from Spark, for enabling spark shuffle service, other running

The error to read HDFS custom file in spark.

2016-03-18 Thread Tony Liu
Hi, My HDFS file is store with custom data structures. I want to read it with SparkContext object.So I define a formatting object: *1. code of RawDataInputFormat.scala* import com.kiisoo.aegis.bd.common.hdfs.RDRawDataRecord import org.apache.hadoop.io.LongWritable import

Re: DistributedLDAModel missing APIs in org.apache.spark.ml

2016-03-18 Thread Ted Yu
Can you utilize this function of DistributedLDAModel ? override protected def getModel: OldLDAModel = oldDistributedModel cheers On Fri, Mar 18, 2016 at 7:34 AM, cindymc wrote: > I like using the new DataFrame APIs on Spark ML, compared to using RDDs in > the

Re: The build-in indexes in ORC file does not work.

2016-03-18 Thread Jörn Franke
Not sure it should work. How many rows are affected? The data is sorted? Have you tried with Tez? Tez has some summary statistics that tells you if you use push down. Maybe you need to use HiveContext. Perhaps a bloom filter could make sense for you as well. > On 16 Mar 2016, at 12:45, Joseph

Can't zip RDDs with unequal numbers of partitions

2016-03-18 Thread Jiří Syrový
Hi, any idea what could be causing this issue? It started appearing after changing parameter *spark.sql.autoBroadcastJoinThreshold to 10* Caused by: java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at

Incomplete data when reading from S3

2016-03-18 Thread Blaž Šnuderl
Hi. We have json data stored in S3 (json record per line). When reading the data from s3 using the following code we started noticing json decode errors. sc.textFile(paths).map(json.loads) After a bit more investigation we noticed an incomplete line, basically the line was > {"key": "value",

?????? Limit pyspark.daemon threads

2016-03-18 Thread Sea
It's useless... The python worker will go above 1.5g in my production environment -- -- ??: "Ted Yu";; : 2016??3??17??(??) 10:50 ??: "Carlile, Ken"; :

Re: Saving intermediate results in mapPartitions

2016-03-18 Thread Enrico Rotundo
Try to set MEMORY_AND_DISK as RDD’s storage persistence level. http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence > On 19 Mar 2016, at 00:55, Krishna wrote: > > Hi, > >

spark shuffle service on yarn

2016-03-18 Thread Koert Kuipers
spark on yarn is nice because i can bring my own spark. i am worried that the shuffle service forces me to use some "sanctioned" spark version that is officially "installed" on the cluster. so... can i safely install the spark 1.3 shuffle service on yarn and use it with other 1.x versions of

Potential conflict with org.iq80.snappy in Spark 1.6.0 environment?

2016-03-18 Thread vasu20
Hi, I have some code that parses a snappy thrift file for objects. This code works fine when run standalone (outside of the Spark environment). However, when running from within Spark, I get an IllegalAccessError exception from the org.iq80.snappy package. Has anyone else seen this error