??????off-heap certain operations

2016-02-11 Thread Sea
spark.memory.offHeap.enabled (default is false) , it is wrong in spark docs. Spark1.6 do not recommend to use off-heap memory. -- -- ??: "Ovidiu-Cristian MARCU";; : 2016??2??12??(??) 5:51 ??: "user"; : off-heap certain oper

mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-11 Thread Stuti Awasthi
Hi All, Im wanted to try Survival Analysis on Spark 1.6. I am successfully able to run the AFT example provided. Now I tried to train the model with Ovarian data which is standard data comes with Survival library in R. Default Column Name : Futime,fustat,age,resid_ds,rx,ecog_ps Here are the ste

Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

2016-02-11 Thread Sebastian Piu
Have you tried using fair scheduler and queues On 12 Feb 2016 4:24 a.m., "p pathiyil" wrote: > With this setting, I can see that the next job is being executed before > the previous one is finished. However, the processing of the 'hot' > partition eventually hogs all the concurrent jobs. If there

Re: SparkSQL parallelism

2016-02-11 Thread Rishi Mishra
I am not sure why all 3 nodes should query. If you have not mentioned any partitions it should only be one partition of JDBCRDD where all dataset should reside. On Fri, Feb 12, 2016 at 10:15 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I have a spark cluster with One Ma

Re: Scala types to StructType

2016-02-11 Thread Yogesh Mahajan
Right, Thanks Ted. On Fri, Feb 12, 2016 at 10:21 AM, Ted Yu wrote: > Minor correction: the class is CatalystTypeConverters.scala > > On Thu, Feb 11, 2016 at 8:46 PM, Yogesh Mahajan > wrote: > >> CatatlystTypeConverters.scala has all types of utility methods to convert >> from Scala to row and v

Re: Scala types to StructType

2016-02-11 Thread Ted Yu
Minor correction: the class is CatalystTypeConverters.scala On Thu, Feb 11, 2016 at 8:46 PM, Yogesh Mahajan wrote: > CatatlystTypeConverters.scala has all types of utility methods to convert > from Scala to row and vice a versa. > > > On Fri, Feb 12, 2016 at 12:21 AM, Rishabh Wadhawan > wrote:

Re: Scala types to StructType

2016-02-11 Thread Yogesh Mahajan
CatatlystTypeConverters.scala has all types of utility methods to convert from Scala to row and vice a versa. On Fri, Feb 12, 2016 at 12:21 AM, Rishabh Wadhawan wrote: > I had the same issue. I resolved it in Java, but I am pretty sure it would > work with scala too. Its kind of a gross hack. Bu

SparkSQL parallelism

2016-02-11 Thread Madabhattula Rajesh Kumar
Hi, I have a spark cluster with One Master and 3 worker nodes. I have written a below code to fetch the records from oracle using sparkSQL val sqlContext = new org.apache.spark.sql.SQLContext(sc) val employees = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:oracle:thin:@:1525

Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

2016-02-11 Thread p pathiyil
With this setting, I can see that the next job is being executed before the previous one is finished. However, the processing of the 'hot' partition eventually hogs all the concurrent jobs. If there was a way to restrict jobs to be one per partition, then this setting would provide the per-partitio

Re: Building Spark with a Custom Version of Hadoop: HDFS ClassNotFoundException

2016-02-11 Thread Ted Yu
The Spark driver does not run on the YARN cluster in client mode, only the Spark executors do. Can you check YARN logs for the failed job to see if there was more clue ? Does the YARN cluster run the customized hadoop or stock hadoop ? Cheers On Thu, Feb 11, 2016 at 5:44 PM, Charlie Wright wrot

Re: Building Spark with a Custom Version of Hadoop: HDFS ClassNotFoundException

2016-02-11 Thread Ted Yu
I think SPARK_CLASSPATH is deprecated. Can you show the command line launching your Spark job ? Which Spark release do you use ? Thanks On Thu, Feb 11, 2016 at 5:38 PM, Charlie Wright wrote: > built and installed hadoop with: > mvn package -Pdist -DskipTests -Dtar > mvn install -DskipTests >

Re: cache DataFrame

2016-02-11 Thread Gaurav Agarwal
Thanks for the below info. I have one more question. I have my own framework where the Sql query is already build ,so I am thinking instead of using data frame filter criteria I could use Dataframe d=sqlcontext.Sql(" and append query here"). d.printschema() List row =d.collectaslist(); Here when I

Spark Summit San Francisco 2016 call for presentations (CFP)

2016-02-11 Thread Reynold Xin
FYI, Call for presentations is now open for Spark Summit. The event will take place on June 6-8 in San Francisco. Submissions are welcome across a variety of Spark-related topics, including applications, development, data science, business value, spark ecosystem and research. Please submit by Febr

Testing email please ignore

2016-02-11 Thread Mich Talebzadeh
-- Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the inte

Question on Spark architecture and DAG

2016-02-11 Thread Mich Talebzadeh
Hi, I have used Hive on Spark engine and of course Hive tables and its pretty impressive comparing Hive using MR engine. Let us assume that I use spark shell. Spark shell is a client that connects to spark master running on a host and port like below spark-shell --master spark://50.140.197.21

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-11 Thread Sebastian Piu
Looks like mapWithState could help you? On 11 Feb 2016 8:40 p.m., "Abhishek Anand" wrote: > Hi All, > > I have an use case like follows in my production environment where I am > listening from kafka with slideInterval of 1 min and windowLength of 2 > hours. > > I have a JavaPairDStream where for

Re: Skip empty batches - spark streaming

2016-02-11 Thread Sebastian Piu
Thanks for clarifying Cody. I will extend the current behaviour for my use case. If there is anything worth sharing I'll run it through the list Cheers On 11 Feb 2016 9:47 p.m., "Cody Koeninger" wrote: > Please don't change the behavior of DirectKafkaInputDStream. > Returning an empty rdd is (im

off-heap certain operations

2016-02-11 Thread Ovidiu-Cristian MARCU
Hi, Reading though the latest documentation for Memory management I can see that the parameter spark.memory.offHeap.enabled (true by default) is described with ‘If true, Spark will attempt to use off-heap memory for certain operations’ [1]. Can you please describe the certain operations you are

AmpLab Big Data Benchmark for Spark error on EC2

2016-02-11 Thread cheez
I am trying to run the Big Data benchmark on my EC2 cluster for my own Spark fork of version 1.5. It just modifies some files on the Spark core. My cluster contains 1 master and 2 slave nodes of type m1.large. I use the ec2 scripts bundled with Spark t

Re: Skip empty batches - spark streaming

2016-02-11 Thread Cody Koeninger
Please don't change the behavior of DirectKafkaInputDStream. Returning an empty rdd is (imho) the semantically correct thing to do, and some existing jobs depend on that behavior. If it's really an issue for you, you can either override directkafkainputdstream, or just check isEmpty as the first t

Re: Skip empty batches - spark streaming

2016-02-11 Thread Andy Davidson
You can always call rdd.isEmpty() Andy private static void save(JavaDStream jsonRdd, String outputURI) { jsonTweets.foreachRDD(new VoidFunction2, Time>() { private static final long serialVersionUID = 1L; @Override public void call(JavaRDD rd

Re: Skip empty batches - spark streaming

2016-02-11 Thread Sebastian Piu
Yes, and as far as I recall it also has partitions (empty) which screws up the isEmpty call if the rdd has been transformed down the line. I will have a look tomorrow at the office and see if I can collaborate On 11 Feb 2016 9:14 p.m., "Shixiong(Ryan) Zhu" wrote: > Yeah, DirectKafkaInputDStream a

newbie unable to write to S3 403 forbidden error

2016-02-11 Thread Andy Davidson
I am using spark 1.6.0 in a cluster created using the spark-ec2 script. I am using the standalone cluster manager My java streaming app is not able to write to s3. It appears to be some for of permission problem. Any idea what the problem might be? I tried use the IAM simulator to test the polic

Re: Skip empty batches - spark streaming

2016-02-11 Thread Shixiong(Ryan) Zhu
Yeah, DirectKafkaInputDStream always returns a RDD even if it's empty. Feel free to send a PR to improve it. On Thu, Feb 11, 2016 at 1:09 PM, Sebastian Piu wrote: > I'm using the Kafka direct stream api but I can have a look on extending > it to have this behaviour > > Thanks! > On 11 Feb 2016 9

Re: Computing hamming distance over large data set

2016-02-11 Thread Brian Morton
Karl, This is tremendously useful. Thanks very much for your insight. Brian On Thu, Feb 11, 2016 at 12:58 PM, Karl Higley wrote: > Hi, > > It sounds like you're trying to solve the approximate nearest neighbor > (ANN) problem. With a large dataset, parallelizing a brute force O(n^2) > approac

Re: Skip empty batches - spark streaming

2016-02-11 Thread Sebastian Piu
I'm using the Kafka direct stream api but I can have a look on extending it to have this behaviour Thanks! On 11 Feb 2016 9:07 p.m., "Shixiong(Ryan) Zhu" wrote: > Are you using a custom input dstream? If so, you can make the `compute` > method return None to skip a batch. > > On Thu, Feb 11, 201

Re: Spark workers disconnecting on 1.5.2

2016-02-11 Thread Darren Govoni
Myself and others on here encounter situations where tasks get stuck on the last stage and the entire job just hangs. No exceptions or nothing. Most likely a deadlock somewhere inside spark. Sent from my Verizon Wireless 4G LTE smartphone Original message From: Andy Max

Re: Skip empty batches - spark streaming

2016-02-11 Thread Shixiong(Ryan) Zhu
Are you using a custom input dstream? If so, you can make the `compute` method return None to skip a batch. On Thu, Feb 11, 2016 at 1:03 PM, Sebastian Piu wrote: > I was wondering if there is there any way to skip batches with zero events > when streaming? > By skip I mean avoid the empty rdd fr

Skip empty batches - spark streaming

2016-02-11 Thread Sebastian Piu
I was wondering if there is there any way to skip batches with zero events when streaming? By skip I mean avoid the empty rdd from being created at all?

Stateful Operation on JavaPairDStream Help Needed !!

2016-02-11 Thread Abhishek Anand
Hi All, I have an use case like follows in my production environment where I am listening from kafka with slideInterval of 1 min and windowLength of 2 hours. I have a JavaPairDStream where for each key I am getting the same key but with different value,which might appear in the same batch or some

Re: Spark History Server pointing to S3

2016-02-11 Thread Vladimir Grigor
Gianluca, have you had any luck with S3 storage of logs for history server? Best, Vladimir On Tue, Jun 16, 2015 at 6:53 PM, Gianluca Privitera < gianluca.privite...@studio.unibo.it> wrote: > It gives me an exception with > org.apache.spark.deploy.history.FsHistoryProvider , a problem with the

Re: Spark workers disconnecting on 1.5.2

2016-02-11 Thread Andy Max
No, ours are running on Docker containers spread across few physical servers. Databricks runs their service on AWS. Wonder if they are seeing this issues. What other problems are you seeing with 1.5.2? We are seeing some executors get stuck in loading state. On Thu, Feb 11, 2016 at 11:16 AM, Da

Re: How to parallel read files in a directory

2016-02-11 Thread Jakob Odersky
Hi Junjie, How do you access the files currently? Have you considered using hdfs? It's designed to be distributed across a cluster and Spark has built-in support. Best, --Jakob On Feb 11, 2016 9:33 AM, "Junjie Qian" wrote: > Hi all, > > I am working with Spark 1.6, scala and have a big dataset

RE: Spark workers disconnecting on 1.5.2

2016-02-11 Thread Darren Govoni
I see this too. Might explain some other serious problems we're having with 1.5.2 Is your cluster in AWS? Sent from my Verizon Wireless 4G LTE smartphone Original message From: Andy Max Date: 02/11/2016 2:12 PM (GMT-05:00) To: user@spark.apache.org Subject: Spark w

Spark workers disconnecting on 1.5.2

2016-02-11 Thread Andy Max
I launched a 4 node Spark 1.5.2 cluster. No activity for a day or so. Now noticed that few of the workers are disconnected. Don't see this issue on Spark 1.4 or Spark 1.3. Would appreciate any pointers. Thx

best practices? spark streaming writing output detecting disk full error

2016-02-11 Thread Andy Davidson
We recently started a Spark/Spark Streaming POC. We wrote a simple streaming app in java to collect tweets. We choose twitter because we new we get a lot of data and probably lots of burst. Good for stress testing We spun up a couple of small clusters using the spark-ec2 script. In one cluster we

Re: Passing a dataframe to where clause + Spark SQL

2016-02-11 Thread Rishabh Wadhawan
Hi Divya Considering you are able to successfully load both tables testCond and test as data frames. As now taking your case: when you do val condval = testCond.select(“Cond”) //Where Cond is a column name, here condval is a DataFrame.Even if it has one row, it is still a data frame if you want

Re: Kafka + Spark 1.3 Integration

2016-02-11 Thread Cody Koeninger
That's what the kafkaParams argument is for. Not all of the kafka configuration parameters will be relevant, though. On Thu, Feb 11, 2016 at 12:07 PM, Nipun Arora wrote: > Hi , > > Thanks for the explanation and the example link. Got it working. > A follow up question. In Kafka one can define p

Re: Scala types to StructType

2016-02-11 Thread Rishabh Wadhawan
I had the same issue. I resolved it in Java, but I am pretty sure it would work with scala too. Its kind of a gross hack. But what I did is say I had a table in Mysql with 1000 columns what is did is that I threw a jdbc query to extracted the schema of the table. I stored that schema and wrote a

Re: Spark Certification

2016-02-11 Thread Prem Sure
I did recently. it includes MLib & Graphx too and I felt like exam content covered all topics till 1.3 and not the > 1.3 versions of spark. On Thu, Feb 11, 2016 at 9:39 AM, Janardhan Karri wrote: > I am planning to do that with databricks > http://go.databricks.com/spark-certified-developer > >

Re: cache DataFrame

2016-02-11 Thread Rishabh Wadhawan
Hi Gaurav Spark will not load the tables into memory at both the points as DataFrames are just abstractions of something that might happen in future when you actually throw an (ACTION) like say df.collectAsList or df.show. When you run DataFrame df = sContext.load("jdbc","(select * from employe

Re: LogisticRegressionModel not able to load serialized model from S3

2016-02-11 Thread Utkarsh Sengar
The problem turned out to be corrupt parquet data, the error was a bit misleading by spark though. On Mon, Feb 8, 2016 at 3:41 PM, Utkarsh Sengar wrote: > I am storing a model in s3 in this path: > "bucket_name/p1/models/lr/20160204_0410PM/ser" and the structure of the > saved dir looks like thi

ApacheCon NA 2016 - Important Dates!!!

2016-02-11 Thread Melissa Warnkin
Hello everyone! I hope this email finds you well.  I hope everyone is as excited about ApacheCon as I am! I'd like to remind you all of a couple of important dates, as well as ask for your assistance in spreading the word! Please use your social media platform(s) to get the word out! The more v

cache DataFrame

2016-02-11 Thread Gaurav Agarwal
Hi When the dataFrame will load the table into memory when it reads from HIVe/Phoenix or from any database. These are two points where need one info , when tables will be loaded into memory or cached when at point 1 or point 2 below. 1. DataFrame df = sContext.load("jdbc","(select * from employ

Re: spark thrift server transport protocol

2016-02-11 Thread Ted Yu
>From the head of HiveThriftServer2 : * The main entry point for the Spark SQL port of HiveServer2. Starts up a `SparkSQLContext` and a * `HiveThriftServer2` thrift server. Looking at HiveServer2.java from Hive, looks like it uses thrift protocol. FYI On Thu, Feb 11, 2016 at 9:34 AM, Sanjeev

Re: Spark execuotr Memory profiling

2016-02-11 Thread Rishabh Wadhawan
Hi All Please check this jira ticket regarding the issue. I was having the same issue with shuffling. Seems like the shuffling memory max is 2g. https://issues.apache.org/jira/browse/SPARK-5928 > On Feb 11, 2016, at 9:08 AM, arun.bong...@cogniza

Re: Kafka + Spark 1.3 Integration

2016-02-11 Thread Nipun Arora
Hi , Thanks for the explanation and the example link. Got it working. A follow up question. In Kafka one can define properties as follows: Properties props = new Properties(); props.put("zookeeper.connect", zookeeper); props.put("group.id", groupId); props.put("zookeeper.session.timeout.ms", "500

Re: Dataframes

2016-02-11 Thread Rishabh Wadhawan
Hi Gaurav I am not sure what you are trying to do here as you are naming two data frames with the same name which would be a compilation error in java. However, after trying to see what you are asking, as of what I understand your question is. You can do something like this; > SqlContext sqlCo

Re: Computing hamming distance over large data set

2016-02-11 Thread Karl Higley
Hi, It sounds like you're trying to solve the approximate nearest neighbor (ANN) problem. With a large dataset, parallelizing a brute force O(n^2) approach isn't likely to help all that much, because the number of pairwise comparisons grows quickly as the size of the dataset increases. I'd look at

spark thrift server transport protocol

2016-02-11 Thread Sanjeev Verma
I am running spark thrift server to run query over hive, how do I can know which transport protocol my client and server using currently. Thanks

How to parallel read files in a directory

2016-02-11 Thread Junjie Qian
Hi all, I am working with Spark 1.6, scala and have a big dataset divided into several small files. My question is: right now the read operation takes really long time and often has RDD warnings. Is there a way I can read the files in parallel, that all nodes or workers read the file at the same

Re: Dataframes

2016-02-11 Thread Ted Yu
bq. Whether sContext(SQlCOntext) will help to query in both the dataframes and will it decide on which dataframe to query for . Can you clarify what you were asking ? The queries would be carried out on respective DataFrame's as shown in your snippet. On Thu, Feb 11, 2016 at 8:47 AM, Gaurav Agar

Re: Dataframes

2016-02-11 Thread Gaurav Agarwal
Thanks That's what i tried to do , but for these two dataframes sqlContext is only one . DataFrame tableA = sqlContext.read().jdbc(url,"tableA",prop); DataFrame tableB = sqlContext.read().jdbc(url,"tableB",prop); When i will say like this SqlContext sContext = new SQlContext(sc) DataFrame d

RE: Spark execuotr Memory profiling

2016-02-11 Thread ARUN.BONGALE
Hi All, Even i have same issues. EMR conf is 3 node cluster with m3.2xlarge. i'm tyring to read 100Gb file in spark-sql i have set below on spark export SPARK_EXECUTOR_MEMORY=4G export SPARK_DRIVER_MEMORY=12G export SPARK_EXECUTOR_INSTANCES=16 export SPARK_EXECUTOR_CORES=16 spark.kryoseriali

RE: Spark Streaming with Kafka: Dealing with 'slow' partitions

2016-02-11 Thread Diwakar Dhanuskodi
Hi, Did  you try  another  implementation  of  DirectStream where you  give  only   topic. It would  read all  topic partitions in parallel  under a batch   interval . You  need  not create union explicitly.  Sent from Samsung Mobile. Original message From: p pathiyil Date:11

Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

2016-02-11 Thread Cody Koeninger
The real way to fix this is by changing partitioning, so you don't have a hot partition. It would be better to do this at the time you're producing messages, but you can also do it with a shuffle / repartition during consuming. There is a setting to allow another batch to start in parallel, but t

Re: Spark Certification

2016-02-11 Thread Janardhan Karri
I am planning to do that with databricks http://go.databricks.com/spark-certified-developer Regards, Janardhan On Thu, Feb 11, 2016 at 2:00 PM, Timothy Spann wrote: > I was wondering that as well. > > Also is it fully updated for 1.6? > > Tim > http://airisdata.com/ > http://sparkdeveloper.com/

PySpark : couldn't pickle object of type class T

2016-02-11 Thread Anoop Shiralige
Hi All, I am working with Spark 1.6.0 and pySpark shell specifically. I have an JavaRDD[org.apache.avro.GenericRecord] which I have converted to pythonRDD in the following way. javaRDD = sc._jvm.java.package.loadJson("path to data", sc._jsc) javaPython = sc._jvm.SerDe.javaToPython(javaRDD) from

Scala types to StructType

2016-02-11 Thread Fabian Böhnlein
Hi all, is there a way to create a Spark SQL Row schema based on Scala data types without creating a manual mapping? That's the only example I can find which doesn't require spark.sql.types.DataType already as input, but it requires to define them as Strings. * val struct = (new StructType

RE: Dataframes

2016-02-11 Thread Prashant Verma
Hi Gaurav, You can try something like this. SparkConf conf = new SparkConf(); JavaSparkContext sc = new JavaSparkContext(conf); SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc); Class.forName("com.mysql.jdbc.Driver"); String url="url"; Properties prop = new java.util

Dataframes

2016-02-11 Thread Gaurav Agarwal
Hi Can we load 5 data frame for 5 tables in one spark context. I am asking why because we have to give Map options= new hashmap(); Options.put(driver,""); Options.put(URL,""); Options.put(dbtable,""); I can give only table query at time in dbtable options . How will I register multiple queries a

Re: Spark Certification

2016-02-11 Thread Timothy Spann
I was wondering that as well. Also is it fully updated for 1.6? Tim http://airisdata.com/ http://sparkdeveloper.com/ From: naga sharathrayapati mailto:sharathrayap...@gmail.com>> Date: Wednesday, February 10, 2016 at 11:36 PM To: "user@spark.apache.org" mailto:us

Re: spark shell ini file

2016-02-11 Thread Ted Yu
Please see: [SPARK-13086][SHELL] Use the Scala REPL settings, to enable things like `-i file` On Thu, Feb 11, 2016 at 1:45 AM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > Hi, > > > > in Hive one can use -I parameter to preload certain setting into the > beeline etc.

Spark Streaming with Kafka: Dealing with 'slow' partitions

2016-02-11 Thread p pathiyil
Hi, I am looking at a way to isolate the processing of messages from each Kafka partition within the same driver. Scenario: A DStream is created with the createDirectStream call by passing in a few partitions. Let us say that the streaming context is defined to have a time duration of 2 seconds.

Re: Inserting column to DataFrame

2016-02-11 Thread Michał Zieliński
I think a good idea would be to do a join: outputDF = unlabelledDF.join(predictedDF.select(“id”,”predicted”),”id”) On 11 February 2016 at 10:12, Zsolt Tóth wrote: > Hi, > > I'd like to append a column of a dataframe to another DF (using Spark > 1.5.2): > > DataFrame outputDF = unlabelledDF.with

Inserting column to DataFrame

2016-02-11 Thread Zsolt Tóth
Hi, I'd like to append a column of a dataframe to another DF (using Spark 1.5.2): DataFrame outputDF = unlabelledDF.withColumn("predicted_label", predictedDF.col("predicted")); I get the following exception: java.lang.IllegalArgumentException: requirement failed: DataFrame must have the same sc

RE: Getting prediction values in spark mllib

2016-02-11 Thread Chandan Verma
Thanks it got solved ☺ From: Artem Aliev [mailto:artem.al...@gmail.com] Sent: Thursday, February 11, 2016 3:19 PM To: Sonal Goyal Cc: Chandan Verma; user@spark.apache.org Subject: Re: Getting prediction values in spark mllib It depends on Algorithm you use NaiveBayesModel has predictProbabilities

Re: Getting prediction values in spark mllib

2016-02-11 Thread Artem Aliev
It depends on Algorithm you use NaiveBayesModel has predictProbabilities method to work dirrectly with probabilites the LogisiticRegresionModel and SVMModel clearThreshold() will make predict method returns probabilites as mentioned above On Thu, Feb 11, 2016 at 11:17 AM, Sonal Goyal wrote: > Lo

spark shell ini file

2016-02-11 Thread Mich Talebzadeh
Hi, in Hive one can use -I parameter to preload certain setting into the beeline etc. Is there equivalent parameter for spark-shell as well. for example spark-shell --master spark://50.140.197.217:7077 can I pass a parameter file? Thanks -- Mich Talebzadeh LinkedIn https://www.l

Re: retrieving all the rows with collect()

2016-02-11 Thread Mich Talebzadeh
Thanks Jacob much appreciated Mich On 11/02/2016 00:01, Jakob Odersky wrote: > Exactly! > As a final note, `foreach` is also defined on RDDs. This means that > you don't need to `collect()` the results into an array (which could > give you an OutOfMemoryError in case the RDD is really real

[OT] Apache Spark Jobs in Kochi, India

2016-02-11 Thread Andrew Holway
Hello, I'm not sure how appropriate job postings are to a user group. We're getting deep into spark and are looking for some talent in our Kochi office. http://bit.ly/Spark-Eng - Apache Spark Engineer / Architect - Kochi http://bit.ly/Spark-Dev - Lead Apache Spark Developer - Kochi Sorry for th

Re: Getting prediction values in spark mllib

2016-02-11 Thread Sonal Goyal
Looks like you are doing binary classification and you are getting the label out. If you clear the model threshold, you should be able to get the raw score. On Feb 11, 2016 1:32 PM, "Chandan Verma" wrote: > > > Following is the code Snippet > > > > > > JavaRDD> predictionAndLabels = data > >

Getting prediction values in spark mllib

2016-02-11 Thread Chandan Verma
Following is the code Snippet JavaRDD> predictionAndLabels = data .map(new Function>() { public Tuple2 call(LabeledPoint p) { Double prediction = sameModel.predict(p.features()); return new Tuple2(prediction, p.labe