How to make this Spark 1.5.2 code fast and shuffle less data

2015-12-10 Thread unk1102
Hi I have spark job which reads Hive-ORC data and processes and generates csv file in the end. Now this ORC files are hive partitions and I have around 2000 partitions to process every day. These hive partitions size is around 800 GB in HDFS. I have the following method code which I call it from a

Spark 1.3.1 - Does SparkConext in multi-threaded env requires SparkEnv.set(env) anymore

2015-12-10 Thread Nirav Patel
As subject says, do we still need to use static env in every thread that access sparkContext? I read some ref here. http://qnalist.com/questions/4956211/is-spark-context-in-local-mode-thread-safe -- [image: What's New with Xactly]

Re: How to make this Spark 1.5.2 code fast and shuffle less data

2015-12-10 Thread Benyi Wang
DataFrame filterFrame1 = sourceFrame.filter(col("col1").contains("xyz"));DataFrame frameToProcess = sourceFrame.except(filterFrame1); except is really expensive. Do you actually want this: sourceFrame.filter(! col("col1").contains("xyz")) ​ On Thu, Dec 10, 2015 at 9:57 AM, unk1102

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Brian London
Nick's symptoms sound identical to mine. I should mention that I just pulled the latest version from github and it seems to be working there. To reproduce: 1. Download spark 1.5.2 from http://spark.apache.org/downloads.html 2. build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0

Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-10 Thread Jakob Odersky
Could you provide some more context? What is rawData? On 10 December 2015 at 06:38, Bonsen wrote: > I do like this "val secondData = rawData.flatMap(_.split("\t").take(3))" > > and I find: > 15/12/10 22:36:55 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, >

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Jean-Baptiste Onofré
Hi Nick, Just to be sure: don't you see some ClassCastException in the log ? Thanks, Regards JB On 12/10/2015 07:56 PM, Nick Pentreath wrote: Could you provide an example / test case and more detail on what issue you're facing? I've just tested a simple program reading from a dev Kinesis

Re: How to make this Spark 1.5.2 code fast and shuffle less data

2015-12-10 Thread Benyi Wang
I don't understand this: "I have the following method code which I call it from a thread spawn from spark driver. So in this case 2000 threads ..." Why do you call it from a thread? Are you process one partition in one thread? On Thu, Dec 10, 2015 at 11:13 AM, Benyi Wang

Re: How to make this Spark 1.5.2 code fast and shuffle less data

2015-12-10 Thread Umesh Kacha
Hi Benyi thanks for the reply yes I call each hive partition/ hdfs directory in one thread so that I can make it faster if I dont use threads then job is even more slow. Like I mentioned I have to process 2000 hive partitions so 2000 hdfs direcotories containing ORC files right? If I dont use

[mesos][docker] addFile doesn't work properly

2015-12-10 Thread PHELIPOT, REMY
Hello! I'm using Apache Spark with Mesos, and I've launched a job with coarse-mode=true. In my job, I must download a file from the internet, so I'm using: import org.apache.spark.SparkFiles sc.addFile("http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv;) val path =

Replaying an RDD in spark streaming to update an accumulator

2015-12-10 Thread AliGouta
I am actually running out of options. In my spark streaming application. I want to keep a state on some keys. I am getting events from Kafka. Then I extract keys from the event, say userID. When there is no events coming from Kafka I want to keep updating a counter relative to each user ID each 3

Re: Rule Engine for Spark

2015-12-10 Thread Luciano Resende
Adrian, Do you have an example that would show how to change the filters at runtime by goind RDD -> DF -> SQL -> RDD ? A non-working pseudo code would work. Thank you On Wed, Nov 4, 2015 at 3:02 AM, Adrian Tanase wrote: > Another way to do it is to extract your filters as

Re: Unable to acces hive table (created through hive context) in hive console

2015-12-10 Thread Harsh J
Are you certain you are providing Spark with the right Hive configuration? Is there a valid HIVE_CONF_DIR defined in your spark-env.sh, with a hive-site.xml detailing the location/etc. of the metastore service and/or DB? Without a valid metastore config, Hive may switch to using a local

Re: Spark groupby and agg inconsistent and missing data

2015-12-10 Thread Kapil Raaj
Hi Folks, I am also getting similar issue: (df.groupBy("email").agg(last("user_id") as "user_id").select("user_id").count,df.groupBy("email").agg(last("user_id") as "user_id").select("user_id").distinct.count) When run on one computer it gives: (15123144,15123144) When run on cluster it gives:

Re: INotifyDStream - where to find it?

2015-12-10 Thread Harsh J
I couldn't spot it anywhere on the web so it doesn't look to be contributed yet, but note that the HDFS APIs are already available per https://issues.apache.org/jira/browse/HDFS-6634 (you can see the test case for an implementation guideline in Java:

Re: About the bottleneck of parquet file reading in Spark

2015-12-10 Thread Cheng Lian
Cc Spark user list since this information is generally useful. On Thu, Dec 10, 2015 at 3:31 PM, Lionheart <87249...@qq.com> wrote: > Dear, Cheng > I'm a user of Spark. Our current Spark version is 1.4.1 > In our project, I find there is a bottleneck when loading huge amount > of

RE: FileNotFoundException in appcache shuffle files

2015-12-10 Thread kendal
I have similar issues... Exception only with very large data. And I tried to double the memory or partition as suggested by some google search, but in vain.. any idea? -- View this message in context:

Re: SparkStreaming variable scope

2015-12-10 Thread Pinela
exactly :) thanks Harsh :D On Thu, Dec 10, 2015 at 3:18 AM, Harsh J wrote: > > and then calling getRowID() in the lambda, because the function gets > sent to the executor right? > > Yes, that is correct (vs. a one time evaluation, as was with your > assignment earlier). > >

Help: Get Timeout error and FileNotFoundException when shuffling large files

2015-12-10 Thread kendal
Hi there, My application is simply easy, just read huge files from HDFS with textFile() Then I will map to to tuples, after than a reduceByKey(), finally saveToTextFile(). The problem is when I am dealing with large inputs (2.5T), when the application enter to the 2nd stage -- reduce by key. It

Re: Replaying an RDD in spark streaming to update an accumulator

2015-12-10 Thread Ali Gouta
Indeed, you are right! I felt like I was missing or misunderstanding something. Thank you so much! Ali Gouta. On Thu, Dec 10, 2015 at 10:04 PM, Cody Koeninger wrote: > I'm a little confused as to why you have fake events rather than just > doing foreachRDD or

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Brian London
Yes, it worked in the 1.6 branch as of commit db5165246f2888537dd0f3d4c5a515875c7358ed. That makes it much less serious of an issue, although it would be nice to know what the root cause is to avoid a regression. On Thu, Dec 10, 2015 at 4:03 PM Burak Yavuz wrote: > I've

Re: Spark 1.3.1 - Does SparkConext in multi-threaded env requires SparkEnv.set(env) anymore

2015-12-10 Thread Josh Rosen
Nope, you shouldn't have to do that anymore. As of https://github.com/apache/spark/pull/2624, which is in Spark 1.2.0+, SparkEnv's thread-local stuff was removed and replaced by a simple global variable (since it was used in an *effectively* global way before (see my comments on that PR)). As a

Re: DataFrame creation delay?

2015-12-10 Thread Isabelle Phan
Hi Michael, We have just upgraded to Spark 1.5.0 (actually 1.5.0_cdh-5.5 since we are on cloudera), and Parquet formatted tables. I turned on spark .sql.hive.metastorePartitionPruning=true, but DataFrame creation still takes a long time. Is there any other configuration to consider? Thanks a

Re: Replaying an RDD in spark streaming to update an accumulator

2015-12-10 Thread Cody Koeninger
I'm a little confused as to why you have fake events rather than just doing foreachRDD or foreachPartition on your kafka stream and updating the accumulator there. I'd expect that to run each batch even if the batch had 0 kafka messages in it. On Thu, Dec 10, 2015 at 2:05 PM, AliGouta

Re: Spark on EMR: out-of-the-box solution for real-time application logs monitoring?

2015-12-10 Thread Steve Loughran
> On 10 Dec 2015, at 14:52, Roberto Coluccio wrote: > > Hello, > > I'm investigating on a solution to real-time monitor Spark logs produced by > my EMR cluster in order to collect statistics and trigger alarms. Being on > EMR, I found the CloudWatch Logs + Lambda

Re: Workflow manager for Spark and Spark SQL

2015-12-10 Thread Ali Tajeldin EDU
Hi Alexander, We developed SMV to address the exact issue you mentioned. While it is not a workflow engine per-se, It does allow for the creation of modules with dependency and automates the execution of these modules. See

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Burak Yavuz
I've noticed this happening when there was some dependency conflicts, and it is super hard to debug. It seems that the KinesisClientLibrary version in Spark 1.5.2 is 1.3.0, but it is 1.2.1 in Spark 1.5.1. I feel like that seems to be the problem... Brian, did you verify that it works with the

Re: [mesos][docker] addFile doesn't work properly

2015-12-10 Thread PhuDuc Nguyen
Have you tried setting spark.mesos.uri property like val conf = new SparkConf().set("spark.mesos.uris", ...) val sc = new SparkContext(conf) ... http://spark.apache.org/docs/latest/running-on-mesos.html HTH, Duc On Thu, Dec 10, 2015 at 1:04 PM, PHELIPOT, REMY

RE: Graph visualization tool for GraphX

2015-12-10 Thread Lin, Hao
Hi Andy, quick question, does Spark-Notebook include its own Spark engine, or I need to install Spark separately and point to it from Spark Notebook? thanks From: Lin, Hao [mailto:hao@finra.org] Sent: Tuesday, December 08, 2015 7:01 PM To: andy petrella; Jörn Franke Cc: user@spark.apache.org

Re: FileNotFoundException in appcache shuffle files

2015-12-10 Thread Jiří Syrový
Usually there is another error or log message before FileNotFoundException. Try to check your logs for something like that. 2015-12-10 10:47 GMT+01:00 kendal : > I have similar issues... Exception only with very large data. > And I tried to double the memory or partition as

Re: Can't filter

2015-12-10 Thread Harsh J
Are you sure you do not have any messages preceding the trace, such as one quoting which class is found to be missing? That'd be helpful to see and suggest what may (exactly) be going wrong. It appear similar to https://issues.apache.org/jira/browse/SPARK-8368, but I cannot tell for certain cause

Can't filter

2015-12-10 Thread Бобров Виктор
Hi, I can’t filter my rdd. def filter1(tp: ((Array[String], Int), (Array[String], Int))): Boolean= { tp._1._2 > tp._2._2 } val mail_rdd = sc.parallelize(A.toSeq).cache() val step1 = mail_rdd.cartesian(mail_rdd) val step2 = step1.filter(filter1) Get error “Class not found”. What I’m doing

Re: Can't filter

2015-12-10 Thread Ndjido Ardo Bar
Please send your call stack with the full description of the exception . > On 10 Dec 2015, at 12:10, Бобров Виктор wrote: > > Hi, I can’t filter my rdd. > > def filter1(tp: ((Array[String], Int), (Array[String], Int))): Boolean= { > tp._1._2 > tp._2._2 > } > val mail_rdd =

Inverse of the matrix

2015-12-10 Thread Arunkumar Pillai
Hi I need to find inverse (X(Transpose) * X) matrix. I have found X transpose and matrix multiplication. is there any way to find to find the inverse of the matrix. -- Thanks and Regards Arun

RE: Can't filter

2015-12-10 Thread Бобров Виктор
Spark – 1.5.1, ty for help. import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import scala.io.Source object SimpleApp { def main(args: Array[String]) { var A = scala.collection.mutable.Map[Array[String], Int]() val

Re: Help: Get Timeout error and FileNotFoundException when shuffling large files

2015-12-10 Thread Sudhanshu Janghel
Can you please paste the stack trace. Sudhanshu

RE: Can't filter

2015-12-10 Thread Бобров Виктор
0 = {StackTraceElement@7132} "com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.a(Unknown Source)" 1 = {StackTraceElement@7133} "com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown Source)" 2 = {StackTraceElement@7134}

example of querying LDA model

2015-12-10 Thread Olga Syrova
Dear Spark users, I created an LDA model using Spark in Java and would like to do some similarity queries now, I'm especially interested in "query -> most similar docs" method. I spent many hours looking for some examples how to map the query to LDA space, but didn't come out with any clear

Re: Sharing object/state accross transformations

2015-12-10 Thread JayKay
I solved the problem by passing the HLL object to the function, updating it and returning it as new state. This was obviously a thinking barrier... ;-) -- View this message in context:

Apache spark Web UI on Amazon EMR not working

2015-12-10 Thread sonal sharma
Hi, We are using Spark on Amazon EMR 4.1. To access Spark web UI, we are using the link in yarn resource manager, but we are seeing a blank page on it. Further, using Firefox debugging we noticed that we got a HTTP 500 error in response. We have tried configuring proxy settings for AWS and also

Re: Is Spark History Server supported for Mesos?

2015-12-10 Thread Steve Loughran
On 9 Dec 2015, at 22:01, Kelvin Chu <2dot7kel...@gmail.com> wrote: Spark on YARN can use History Server by setting the configuration spark.yarn.historyServer.address. That's the stuff in SPARK-1537 which isn' actually built in yet. But, I can't find similar

RE: Can't filter

2015-12-10 Thread Бобров Виктор
Build.sbt name := "spark" version := "1.0" scalaVersion := "2.11.7" libraryDependencies ++= Seq( "org.apache.spark" % "spark-core_2.11" % "1.5.1", "org.apache.spark" % "spark-streaming_2.11" % "1.5.1", "org.apache.spark" % "spark-mllib_2.11" %

Workflow manager for Spark and Spark SQL

2015-12-10 Thread Alexander Pivovarov
Hi Everyone I'm curious what people usually use to build ETL workflows based on DataFrames and Spark API? In Hadoop/Hive world people usually use Oozie. Is it different in Spark world?

Structured Vector Format

2015-12-10 Thread Hayri Volkan Agun
Hi everyone, I have a machine learning problem with many numeric features. I need to specify the attribute name of a specific vector index in terms of readability of my feature selection process. Is it possible to name the data frame vector elements like a structured expression. I am using

Re: Warning: Master endpoint spark://ip:7077 was not a REST server. Falling back to legacy submission gateway instead.

2015-12-10 Thread Jakob Odersky
Is there any other process using port 7077? On 10 December 2015 at 08:52, Andy Davidson wrote: > Hi > > I am using spark-1.5.1-bin-hadoop2.6. Any idea why I get this warning. My > job seems to run with out any problem. > > Kind regards > > Andy > > +

Re: Warning: Master endpoint spark://ip:7077 was not a REST server. Falling back to legacy submission gateway instead.

2015-12-10 Thread Andy Davidson
Hi Jakob The cluster was set up using the spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script Given my limited knowledge I think this looks okay? Thanks Andy $ sudo netstat -peant | grep 7077 tcp0 0 :::172-31-30-51:7077:::* LISTEN 0 311641427355/java tcp

Re: How to make this Spark 1.5.2 code fast and shuffle less data

2015-12-10 Thread manasdebashiskar
Have you tried persisting sourceFrame in (MEMORY_AND_DISK)? May be you can cache updatedRDD which gets used in next two lines. Are you sure you are paying the performance penalty because of shuffling only? Yes, group by is a killer. How much time does your code spend it GC? Can't tell if your

Re: Spark job submission REST API

2015-12-10 Thread Andrew Or
Hello, The hidden API was implemented for use internally and there are no plans to make it public at this point. It was originally introduced to provide backward compatibility in submission protocol across multiple versions of Spark. A full-fledged stable REST API for submitting applications

Re: Spark job submission REST API

2015-12-10 Thread Harsh J
You could take a look at Livy also: https://github.com/cloudera/livy#welcome-to-livy-the-rest-spark-server On Fri, Dec 11, 2015 at 8:17 AM Andrew Or wrote: > Hello, > > The hidden API was implemented for use internally and there are no plans > to make it public at this

Re: Help: Get Timeout error and FileNotFoundException when shuffling large files

2015-12-10 Thread manasdebashiskar
Is that the only kind of error you are getting. Is it possible something else dies that gets buried in other messages. Try repairing HDFS (fsck etc) to find if blocks are intact. Few things to check 1) if you have too many small files. 2) Is your system complaining about too many inode etc.. 3)

Re: Kryo Serialization in Spark

2015-12-10 Thread manasdebashiskar
Are you sure you are using Kryo serialization. You are getting a java serialization error. Are you setting up your sparkcontext with kryo serialization enabled? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-Serialization-in-Spark-tp25628p25678.html

Re: State management in spark-streaming

2015-12-10 Thread manasdebashiskar
Have you taken a look at trackStateBykey in spark streaming (coming in spark 1.6) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/State-management-in-spark-streaming-tp25608p25681.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: "Address already in use" after many streams on Kafka

2015-12-10 Thread manasdebashiskar
you can provide spark ui port at while executing your context. spark.ui.port can be set to different port. ..Manas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Address-already-in-use-after-many-streams-on-Kafka-tp25545p25683.html Sent from the Apache

Re: How to control number of parquet files generated when using partitionBy

2015-12-10 Thread manasdebashiskar
partitionBy is a suggestive field. If your value is bigger then what spark calculates(based on the obvious you stated) your value will be used. But repartition is a forced shuffle (but give me exactly required number of partition) operation. You might have noticed that repartition caused a bit of

architecture though experiment: what is the advantage of using kafka with spark streaming?

2015-12-10 Thread Andy Davidson
I noticed that many people are using Kafka and spark streaming. Can some one provide a couple of use case I image some possible use cases might be Is the purpose using Kafka 1. provide some buffering? 2. implementing some sort of load balancing for the over all system? 3. Provide filtering

Re: Warning: Master endpoint spark://ip:7077 was not a REST server. Falling back to legacy submission gateway instead.

2015-12-10 Thread Andrew Or
Hi Andy, You must be running in cluster mode. The Spark Master accepts client mode submissions on port 7077 and cluster mode submissions on port 6066. This is because standalone cluster mode uses a REST API to submit applications by default. If you submit to port 6066 instead the warning should

Re: Spark job submission REST API

2015-12-10 Thread manasdebashiskar
We use ooyala job server. It is great. It has a great set of api's to cancel job. Create adhoc or persistent context etc. It has great support in remote deploy and tests too which helps faster coding. The current version is missing job progress bar but I could not find the same in the hidden

Re: Comparisons between Ganglia and Graphite for monitoring the Streaming Cluster?

2015-12-10 Thread manasdebashiskar
We use graphite monitoring. Currently we miss having email notifications for an alert. Not sure Ganglia has the same caveat. ..Manas -- View this message in context:

Re: Spark sql random number or sequence numbers ?

2015-12-10 Thread manasdebashiskar
use zipwithIndex to achieve the same behavior. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-sql-random-number-or-sequence-numbers-tp25623p25679.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming Shuffle to Disk

2015-12-10 Thread manasdebashiskar
how often do you checkpoint? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Shuffle-to-Disk-tp25567p25682.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Does Spark SQL have to scan all the columns of a table in text format?

2015-12-10 Thread manasdebashiskar
Yes, Text file is schema less. Spark does not know what to skip so it will read everything. Parquet as you have stated is capable of taking advantage of predicate push down. ..Manas -- View this message in context:

Re: StackOverflowError when writing dataframe to table

2015-12-10 Thread Jakob Odersky
Can you give us some more info about the dataframe and caching? Ideally a set of steps to reproduce the issue On 9 December 2015 at 14:59, apu mishra . rr wrote: > The command > > mydataframe.write.saveAsTable(name="tablename") > > sometimes results in

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Nick Pentreath
Yup also works for me on master branch as I've been testing DynamoDB Streams integration. In fact works with latest KCL 1.6.1 also which I was using. So theKCL version does seem like it could be the issue - somewhere along the line an exception must be getting swallowed. Though the tests

Re: DataFrame: Compare each row to every other row?

2015-12-10 Thread manasdebashiskar
You can use the evil "group by key" and use a conventional method to compare against each row with in that iterable. If your similarity function is a n-1 iterable results for n input then you can use a flatmap to do all that stuff on worker side. spark also has cartesian product that might help in

Re: How to change StreamingContext batch duration after loading from checkpoint

2015-12-10 Thread manasdebashiskar
Not sure what is your requirement there, but if you have a 2 second streaming batch , you can create a 4 second stream out of it but the other way is not possible. Basically you can create one stream out of another stream. ..Manas -- View this message in context:

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Burak Yavuz
I don't think the Kinesis tests specifically ran when that was merged into 1.5.2 :( https://github.com/apache/spark/pull/8957 https://github.com/apache/spark/commit/883bd8fccf83aae7a2a847c9a6ca129fac86e6a3 AFAIK pom changes don't trigger the Kinesis tests. Burak On Thu, Dec 10, 2015 at 8:09 PM,

Error Handling approach for SparkSQL queries in Spark version 1.4

2015-12-10 Thread satish chandra j
HI All, Any inputs on error handling approach for Spark SQL or DataFrames Thanks for all your valuable inputs in advance Regards, Satish Chandra

Re: Can't filter

2015-12-10 Thread Sudhanshu Janghel
be sure to mention the class name using the *--class* parameter to spark-submit .. I see no other reason for a "class not found" exception. Sudhanshu On Thu, Dec 10, 2015 at 11:50 AM, Harsh J wrote: > Are you sure you do not have any messages preceding the trace, such as

Warning: Master endpoint spark://ip:7077 was not a REST server. Falling back to legacy submission gateway instead.

2015-12-10 Thread Andy Davidson
Hi I am using spark-1.5.1-bin-hadoop2.6. Any idea why I get this warning. My job seems to run with out any problem. Kind regards Andy + /root/spark/bin/spark-submit --class com.pws.spark.streaming.IngestDriver --master spark://ec2-54-205-209-122.us-west-1.compute.amazonaws.com:7077

Re: memory leak when saving Parquet files in Spark

2015-12-10 Thread Cheng Lian
This is probably caused by schema merging. Were you using Spark 1.4 or earlier versions? Could you please try the following snippet to see whether it helps: df.write .format("parquet") .option("mergeSchema", "false") .partitionBy(partitionCols: _*) .mode(saveMode) .save(targetPath)

Re: RE: Spark assembly in Maven repo?

2015-12-10 Thread fightf...@163.com
Using maven to download the assembly jar is fine. I would recommend to deploy this assembly jar to your local maven repo, i.e. nexus repo, Or more likey a snapshot repository fightf...@163.com From: Xiaoyong Zhu Date: 2015-12-11 15:10 To: Jeff Zhang CC: user@spark.apache.org; Zhaomin Xu;

Re: Spark assembly in Maven repo?

2015-12-10 Thread Jeff Zhang
I don't think make the assembly jar as dependency a good practice. You may meet jar hell issue in that case. On Fri, Dec 11, 2015 at 2:46 PM, Xiaoyong Zhu wrote: > Hi Experts, > > > > We have a project which has a dependency for the following jar > > > >

RE: Spark assembly in Maven repo?

2015-12-10 Thread Xiaoyong Zhu
Sorry – I didn’t make it clear. It’s actually not a “dependency” – it’s actually that we are building a certain plugin for IntelliJ where we want to distribute this jar. But since the jar is updated frequently we don't want to distribute it together with our plugin but we would like to download

Re: RE: Spark assembly in Maven repo?

2015-12-10 Thread Mark Hamstra
No, publishing a spark assembly jar is not fine. See the doc attached to https://issues.apache.org/jira/browse/SPARK-11157 and be aware that a likely goal of Spark 2.0 will be the elimination of assemblies. On Thu, Dec 10, 2015 at 11:19 PM, fightf...@163.com wrote: > Using

Re: How to use collections inside foreach block

2015-12-10 Thread Madabhattula Rajesh Kumar
Hi Rishi and Ted, Thank you for the response. Now I'm using Accumulators and getting results. I have a another query, how to start parallel the code. Example :- var listOfIds is a ListBuffer with 2 records I'm creating batches. For each batch size is 500. It means, total batches are : 40.

Spark streaming with Kinesis broken?

2015-12-10 Thread Brian London
Has anyone managed to run the Kinesis demo in Spark 1.5.2? The Kinesis ASL that ships with 1.5.2 appears to not work for me although 1.5.1 is fine. I spent some time with Amazon earlier in the week and the only thing we could do to make it work is to change the version to 1.5.1. Can someone

Re: architecture though experiment: what is the advantage of using kafka with spark streaming?

2015-12-10 Thread Cody Koeninger
Kafka provides buffering, ordering, decoupling of producers from multiple consumers. So pretty much any time you have requirements for asynchronous process, fault tolerance, and/or a common view of the order of events across multiple consumers kafka is worth a look. Spark provides a much richer

org.apache.spark.SparkException: Task failed while writing rows.+ Spark output data to hive table

2015-12-10 Thread Divya Gehlot
Hi, I am using HDP2.3.2 with Spark 1.4.1 and trying to insert data in hive table using hive context. Below is the sample code 1. spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m 2. //Sample code 3. import org.apache.spark.sql.SQLContext 4. import

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Nick Pentreath
Yeah also the integration tests need to be specifically run - I would have thought the contributor would have run those tests and also tested the change themselves using live Kinesis :( — Sent from Mailbox On Fri, Dec 11, 2015 at 6:18 AM, Burak Yavuz wrote: > I don't

Re: Spark Java.lang.NullPointerException

2015-12-10 Thread michael_han
Hi Sarala, I found the reason, it's because when spark run it still needs Hadoop support, I think it's a bug in Spark and still not fixed now ;) After I download winutils.exe and following the steps from bellow workaround, it works fine:

Spark assembly in Maven repo?

2015-12-10 Thread Xiaoyong Zhu
Hi Experts, We have a project which has a dependency for the following jar spark-assembly--hadoop.jar for example: spark-assembly-1.4.1.2.3.3.0-2983-hadoop2.7.1.2.3.3.0-2983.jar since this assembly might be updated in the future, I am not sure if there is a Maven repo that has the above spark

Re: GLM in apache spark in MLlib

2015-12-10 Thread Yanbo Liang
Hi Arunkumar, LinearRegression, LogisticRegression and AFTSurvivalRegression are parts of GLMs, they are already parts of MLlib. Actually GLM in SparkR calling MLlib as backend execution engine, but only "gaussian" and "binomial" family are supported currently. MLlib will continue to improve

Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-10 Thread Bonsen
I do like this "val secondData = rawData.flatMap(_.split("\t").take(3))" and I find: 15/12/10 22:36:55 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 219.216.65.129): java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer at

Spark Streaming Kinesis - DynamoDB Streams compatability

2015-12-10 Thread Nick Pentreath
Hi Spark users & devs I was just wondering if anyone out there has interest in DynamoDB Streams ( http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html) as an input source for Spark Streaming Kinesis? Because DynamoDB Streams provides an adaptor client that works with the

Spark on EMR: out-of-the-box solution for real-time application logs monitoring?

2015-12-10 Thread Roberto Coluccio
Hello, I'm investigating on a solution to real-time monitor Spark logs produced by my EMR cluster in order to collect statistics and trigger alarms. Being on EMR, I found the CloudWatch Logs + Lambda pretty straightforward and, since I'm on AWS, those service are pretty well integrated

Spark job submission REST API

2015-12-10 Thread mvle
Hi, I would like to use Spark as a service through REST API calls for uploading and submitting a job, getting results, etc. There is a project by the folks at Ooyala: https://github.com/spark-jobserver/spark-jobserver I also encountered some hidden job REST APIs in Spark: