Linear Regression Error

2016-10-12 Thread Meeraj Kunnumpurath
Hello, I have some code trying to compare linear regression coefficients with three sets of features, as shown below. On the third one, I get an assertion error. This is the code, object MultipleRegression extends App { val spark = SparkSession.builder().appName("Regression Model

Spyder and SPARK combination problem...Please help!

2016-10-12 Thread innocent73
Hi folks, I am an exact beginner to spark and Python environment. I have installed spark and would like to run a code snippet named "SpyderSetupForSpark.py": # -*- coding: utf-8 -*- """ Make sure you give execute privileges

Re: Reading from and writing to different S3 buckets in spark

2016-10-12 Thread Mridul Muralidharan
If using RDD's, you can use saveAsHadoopFile or saveAsNewAPIHadoopFile with the conf passed in which overrides the keys you need. For example, you can do : val saveConf = new Configuration(sc.hadoopConfiguration) // configure saveConf with overridden s3 config rdd.saveAsNewAPIHadoopFile(..., conf

Matrix Operations

2016-10-12 Thread Meeraj Kunnumpurath
Hello, Does anyone have examples of doing Matrix operations (multiplication, transpose, inverse etc) using the Spark ML API? Many thanks -- *Meeraj Kunnumpurath* *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597* *00 971 50 409 0169mee...@servicesymphony.com

Memory leak warnings in Spark 2.0.1

2016-10-12 Thread vonnagy
I am getting excessive memory leak warnings when running multiple mapping and aggregations and using DataSets. Is there anything I should be looking for to resolve this or is this a known issue? WARN [Executor task launch worker-0] org.apache.spark.memory.TaskMemoryManager - leak 16.3 MB memory

Re: Spyder and SPARK combination problem...Please help!

2016-10-12 Thread neil90
Are you using Windows? Switching over to Linux environment made that error go away for me. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spyder-and-SPARK-combination-problem-Please-help-tp27882p27884.html Sent from the Apache Spark User List mailing list

Re: Map with state keys serialization

2016-10-12 Thread Shixiong(Ryan) Zhu
Oh, OpenHashMapBasedStateMap is serialized using Kryo's "com.esotericsoftware.kryo.serializers.JavaSerializer". Did you set it for OpenHashMapBasedStateMap? You don't need to set anything for Spark's classes in 1.6.2. On Wed, Oct 12, 2016 at 7:11 AM, Joey Echeverria wrote: > I

Re: Spyder and SPARK combination problem...Please help!

2016-10-12 Thread innocent73
Yes Neil, I use windows 8.1 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spyder-and-SPARK-combination-problem-Please-help-tp27882p27885.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Map with state keys serialization

2016-10-12 Thread Joey Echeverria
That fixed it!. I still had the serializer registered as a workaround for SPARK-12591. Thanks so much for your help Ryan! -Joey On Wed, Oct 12, 2016 at 2:16 PM, Shixiong(Ryan) Zhu wrote: > Oh, OpenHashMapBasedStateMap is serialized using Kryo's >

Spark Shuffle Issue

2016-10-12 Thread Ankur Srivastava
Hi, I am upgrading my jobs to Spark 1.6 and am running into shuffle issues. I have tried all options and now am falling back to legacy memory model but still running into same issue. I have set spark.shuffle.blockTransferService to nio. 16/10/12 06:00:10 INFO MapOutputTrackerMaster: Size of

Re: JSON Arrays and Spark

2016-10-12 Thread sujeet jog
I generally use Play Framework Api's for comple json structures. https://www.playframework.com/documentation/2.5.x/ScalaJson#Json On Wed, Oct 12, 2016 at 11:34 AM, Kappaganthu, Sivaram (ES) < sivaram.kappagan...@adp.com> wrote: > Hi, > > > > Does this mean that handling any Json with kind of

RE: JSON Arrays and Spark

2016-10-12 Thread Kappaganthu, Sivaram (ES)
Hi, Does this mean that handling any Json with kind of below schema with spark is not a good fit?? I have requirement to parse the below Json that spans across multiple lines. Whats the best way to parse the jsns of this kind?? Please suggest. root |-- maindate: struct (nullable = true) |

Re: JSON Arrays and Spark

2016-10-12 Thread Hyukjin Kwon
No, I meant it should be in a single line but it supports array type too as a root wrapper of JSON objects. If you need to parse multiple lines, I have a reference here. http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/ 2016-10-12 15:04 GMT+09:00 Kappaganthu,

Re: Reading from and writing to different S3 buckets in spark

2016-10-12 Thread Steve Loughran
On 12 Oct 2016, at 10:49, Aseem Bansal > wrote: Hi I want to read CSV from one bucket, do some processing and write to a different bucket. I know the way to set S3 credentials using jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId",

Kafka integration: get existing Kafka messages?

2016-10-12 Thread Haopu Wang
Hi, I want to read the existing Kafka messages and then subscribe new stream messages. But I find "auto.offset.reset" property is always set to "none" in KafkaUtils. Does that mean I cannot specify "earliest" property value when create direct stream? Thank you!

Reading from and writing to different S3 buckets in spark

2016-10-12 Thread Aseem Bansal
Hi I want to read CSV from one bucket, do some processing and write to a different bucket. I know the way to set S3 credentials using jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY) jssc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY) But the

Re: mllib model in production web API

2016-10-12 Thread Aseem Bansal
Hi Faced a similar issue. Our solution was to load the model, cache it after converting it to a model from mllib and then use that instead of ml model. On Tue, Oct 11, 2016 at 10:22 PM, Sean Owen wrote: > I don't believe it will ever scale to spin up a whole distributed job

Re: Spark ML OOM problem

2016-10-12 Thread Jörn Franke
Which Spark version? Are you using RDDs? Or datasets? What type are the features? If string how large? Is it spark standalone? How do you train/configure the algorithm. How do you initially parse the data? The standard driver and executor logs could be helpful. > On 12 Oct 2016, at 09:24, 陈哲

Spark-Sql 2.0 nullpointerException

2016-10-12 Thread Selvam Raman
Hi , I am reading parquet file and creating temp table. when i am trying to execute query outside of foreach function it is working fine. throws nullpointerexception within data frame.foreach function. code snippet: String CITATION_QUERY = "select c.citation_num, c.title, c.publisher from test

Re: Spark-Sql 2.0 nullpointerException

2016-10-12 Thread Selvam Raman
What i am trying to achieve is Trigger query to get number(i.e.,1,2,3,...n) for every number i have to trigger another 3 queries. Thanks, selvam R On Wed, Oct 12, 2016 at 4:10 PM, Selvam Raman wrote: > Hi , > > I am reading parquet file and creating temp table. when i am

Re: Map with state keys serialization

2016-10-12 Thread Joey Echeverria
I tried with 1.6.2 and saw the same behavior. -Joey On Tue, Oct 11, 2016 at 5:18 PM, Shixiong(Ryan) Zhu wrote: > There are some known issues in 1.6.0, e.g., > https://issues.apache.org/jira/browse/SPARK-12591 > > Could you try 1.6.1? > > On Tue, Oct 11, 2016 at 9:55 AM,

Spark 2.0 Encoder().schema() is sorting StructFields

2016-10-12 Thread Paul Stewart
Hi all, I am using Spark 2.0 to read a CSV file into a Dataset in Java. This works fine if i define the StructType with the StructField array ordered by hand. What I would like to do is use a bean class for both the schema and Dataset row type. For example, Dataset beanDS =

Re: What happens when an executor crashes?

2016-10-12 Thread Cody Koeninger
Yes, partitionBy will shuffle unless it happens to be partitioning with the exact same partitioner the parent rdd had. On Wed, Oct 12, 2016 at 8:34 AM, Samy Dindane wrote: > Hey Cody, > > I ended up choosing a different way to do things, which is using Kafka to > commit my

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Cody Koeninger
How would backpressure know anything about the capacity of your system on the very first batch? You should be able to set maxRatePerPartition at a value that makes sure your first batch doesn't blow things up, and let backpressure scale from there. On Wed, Oct 12, 2016 at 8:53 AM, Samy Dindane

Re: What happens when an executor crashes?

2016-10-12 Thread Samy Dindane
Hey Cody, I ended up choosing a different way to do things, which is using Kafka to commit my offsets. It works fine, except it stores the offset in ZK instead of a Kafka topic (investigating this right now). I understand your explanations, thank you, but I have one question: when you say

Re: Kafka integration: get existing Kafka messages?

2016-10-12 Thread Cody Koeninger
its set to none for the executors, because otherwise they wont do exactly what the driver told them to do. you should be able to set up the driver consumer to determine batches however you want, though. On Wednesday, October 12, 2016, Haopu Wang wrote: > Hi, > > > > I want

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Samy Dindane
That's what I was looking for, thank you. Unfortunately, neither * spark.streaming.backpressure.initialRate * spark.streaming.backpressure.enabled * spark.streaming.receiver.maxRate * spark.streaming.receiver.initialRate change how many records I get (I tried many different combinations). The

Re: Linear Regression Error

2016-10-12 Thread Meeraj Kunnumpurath
If I drop the last feature on the third model, the error seems to go away. On Wed, Oct 12, 2016 at 11:52 PM, Meeraj Kunnumpurath < mee...@servicesymphony.com> wrote: > Hello, > > I have some code trying to compare linear regression coefficients with > three sets of features, as shown below. On

Re: Linear Regression Error

2016-10-12 Thread Sean Owen
See https://issues.apache.org/jira/browse/SPARK-17588 On Wed, Oct 12, 2016 at 9:07 PM Meeraj Kunnumpurath < mee...@servicesymphony.com> wrote: > If I drop the last feature on the third model, the error seems to go away. > > On Wed, Oct 12, 2016 at 11:52 PM, Meeraj Kunnumpurath < >

Spark ML OOM problem

2016-10-12 Thread 陈哲
Hi I'm using spark ml to train RandomForest Model . There is about over 200, 000 lines in the training data file and about 100 features. I'm running spark in local mode and with JAVA_OPTS like: -Xms1024m -Xmx10296m -XX:+PrintGCDetails -XX:+PrintGCTimeStamps, but OOM error keep coming out, I

Re: Spark Shuffle Issue

2016-10-12 Thread Ankur Srivastava
Hi, I was able to resolve the issue with increasing the timeout and reducing the number of executors and increasing number of cores per executor. The issue is resolved but I am still not sure why reducing the number of executors and increasing number of cores per executor fixed issues related to

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Cody Koeninger
Cool, just wanted to make sure. To answer your question about > Isn't "spark.streaming.backpressure.initialRate" supposed to do this? that configuration was added well after the integration of the direct stream with the backpressure code, and was added only to the receiver code, which the

UDF on multiple columns

2016-10-12 Thread Meeraj Kunnumpurath
Hello, How do I write a UDF that operate on two columns. For example, how do I introduce a new column, which is a product of two columns already on the dataframe. Many thanks Meeraj

Re: UDF on multiple columns

2016-10-12 Thread Meeraj Kunnumpurath
This is what I do at the moment, def build(path: String, spark: SparkSession) = { val toDouble = udf((x: String) => x.toDouble) val df = spark.read. option("header", "true"). csv(path). withColumn("sqft_living", toDouble('sqft_living)). withColumn("price", toDouble('price)).

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Samy Dindane
I am 100% sure. println(conf.get("spark.streaming.backpressure.enabled")) prints true. On 10/12/2016 05:48 PM, Cody Koeninger wrote: Just to make 100% sure, did you set spark.streaming.backpressure.enabled to true? On Wed, Oct 12, 2016 at 10:09 AM, Samy Dindane wrote:

How to prevent having more than one instance of a specific job running on the cluster

2016-10-12 Thread Samy Dindane
Hi, I'd like a specific job to fail if there's another instance of it already running on the cluster (Spark Standalone in my case). How to achieve this? Thank you. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn

2016-10-12 Thread Stephen Hankinson
Hi, We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had written some new code using the Spark DataFrame/DataSet APIs but are noticing incorrect results on a join after writing and then reading data to Windows Azure Storage Blobs (The default HDFS location). I've been

DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn

2016-10-12 Thread shankinson
Hi, We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had written some new code using the Spark DataFrame/DataSet APIs but are noticing incorrect results on a join after writing and then reading data to Windows Azure Storage Blobs (The default HDFS location). I've been

RE: Kafka integration: get existing Kafka messages?

2016-10-12 Thread Haopu Wang
Cody, thanks for the response. So Kafka direct stream actually has consumer on both the driver and executor? Can you please provide more details? Thank you very much! From: Cody Koeninger [mailto:c...@koeninger.org] Sent: 2016年10月12日 20:10 To: Haopu Wang

Re: Kafka integration: get existing Kafka messages?

2016-10-12 Thread Cody Koeninger
Look at the presentation and blog post linked from https://github.com/koeninger/kafka-exactly-once They refer to the kafka 0.8 version of the direct stream but the basic idea is the same On Wed, Oct 12, 2016 at 7:35 PM, Haopu Wang wrote: > Cody, thanks for the response. >

Unsuscribe

2016-10-12 Thread R. Revert
Unsuscribe El 12 oct. 2016 11:26 p. m., "Reynold Xin" escribió: I took a look at all the public APIs we expose in o.a.spark.sql tonight, and realized we still have a large number of APIs that are marked experimental. Most of these haven't really changed, except in 2.0 we

Mark DataFrame/Dataset APIs stable

2016-10-12 Thread Reynold Xin
I took a look at all the public APIs we expose in o.a.spark.sql tonight, and realized we still have a large number of APIs that are marked experimental. Most of these haven't really changed, except in 2.0 we merged DataFrame and Dataset. I think it's long overdue to mark them stable. I'm tracking

download spark 1.2.1

2016-10-12 Thread Irfan Sayyed
Dears, How can I download spark 1.2.1? I tried from spark site but version 1.2.1 is not available in site. I need Spark 1.2.1 for CCA-175 exam. -- *Regards,* *Irfan Sayyed*