Re: Spark Streaming on Yarn Input from Flume

2014-08-07 Thread Hari Shreedharan
Do you see anything suspicious in the logs? How did you run the application? On Thu, Aug 7, 2014 at 10:02 PM, XiaoQinyu wrote: > Hi~ > > I run a spark streaming app to receive data from flume event.When I run on > standalone,Spark Streaming can receive the Flume event normally .But if I > run t

Shared variable in Spark Streaming

2014-08-07 Thread Soumitra Kumar
Hello, I want to count the number of elements in the DStream, like RDD.count() . Since there is no such method in DStream, I thought of using DStream.count and use the accumulator. How do I do DStream.count() to count the number of elements in a DStream? How do I create a shared variable in Spar

Re: How to use spark-cassandra-connector in spark-shell?

2014-08-07 Thread Andrew Ash
I don't remember the details, but I think it just took adding the spark-cassandra-connector jar to the spark shell's classpath with --jars or maybe ADD_JARS and then it worked. On Thu, Aug 7, 2014 at 10:24 PM, Gary Zhao wrote: > Thanks Andrew. How did you do it? > > > On Thu, Aug 7, 2014 at 10:

Re: How to use spark-cassandra-connector in spark-shell?

2014-08-07 Thread Gary Zhao
Thanks Andrew. How did you do it? On Thu, Aug 7, 2014 at 10:20 PM, Andrew Ash wrote: > Yes, I've done it before. > > > On Thu, Aug 7, 2014 at 10:18 PM, Gary Zhao wrote: > >> Hello >> >> Is it possible to use spark-cassandra-connector in spark-shell? >> >> Thanks >> Gary >> > >

Re: How to use spark-cassandra-connector in spark-shell?

2014-08-07 Thread Andrew Ash
Yes, I've done it before. On Thu, Aug 7, 2014 at 10:18 PM, Gary Zhao wrote: > Hello > > Is it possible to use spark-cassandra-connector in spark-shell? > > Thanks > Gary >

How to use spark-cassandra-connector in spark-shell?

2014-08-07 Thread Gary Zhao
Hello Is it possible to use spark-cassandra-connector in spark-shell? Thanks Gary

Re: Low Performance of Shark over Spark.

2014-08-07 Thread vinay . kashyap
Hi Meng, I cannot use cached table in this case as the data size is quite huge. Also, as I am trying to run adhoc queries, I cannot keep the table cached. I can cache the table only when my requirement is such that, type of queries are fixed and for specific set of data.   Thanks and regards Vin

Spark Streaming on Yarn Input from Flume

2014-08-07 Thread XiaoQinyu
Hi~ I run a spark streaming app to receive data from flume event.When I run on standalone,Spark Streaming can receive the Flume event normally .But if I run this app on yarn,no matter on yarn-client or yarn-cluster. This app can not receive data from flume,and I check the net stat,I find the port

Re: SparkR : lapplyPartition transforms the data in vertical format

2014-08-07 Thread Shivaram Venkataraman
I tried this out and what is happening here is that as the input file is small only 1 partition is created. lapplyPartition runs the given function on the partition and computes sumx as 55 and sumy as 55. Now the return value from lapplyPartition is treated as a list by SparkR and collect concatena

Java API for GraphX

2014-08-07 Thread vdiwakar.malladi
Hi, Could you please let me know whether Java API available for GraphX component or not. If so, could you please help me to point to the same. Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Java-API-for-GraphX-tp11752.html Sent from the

Re: Spark: Could not load native gpl library

2014-08-07 Thread Andrew Ash
Hi Jikai, It looks like you're trying to run a Spark job on data that's stored in HDFS in .lzo format. Spark can handle this (I do it all the time), but you need to configure your Spark installation to know about the .lzo format. There are two parts to the hadoop lzo library -- the first is the

Internal Error: Missing Template ERR_DNS_FAIL

2014-08-07 Thread hansen
There is a error message: "Internal Error: Missing Template ERR_DNS_FAIL" on my Spark cluster's web ui. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Internal-Error-Missing-Template-ERR-DNS-FAIL-tp11750.html Sent from the Apache Spark User List mailing l

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread x
U.rows.toArray.take(1).foreach(println) and V.toArray.take(s.size).foreach(println) are not eigenvectors corresponding to the biggest eigenvalue s.toArray(0)*s. toArray(0)? xj @ Tokyo On Fri, Aug 8, 2014 at 12:07 PM, Chunnan Yao wrote: > Hi there, what you've suggested are all meaningful. But t

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Chunnan Yao
Hi there, what you've suggested are all meaningful. But to make myself clearer, my essential problems are: 1. My matrix is asymmetric, and it is a probabilistic adjacency matrix, whose entries(a_ij) represents the likelihood that user j will broadcast the information generated by user i. Apparently

Re: Got error “"java.lang.IllegalAccessError" when using HiveContext in Spark shell on AWS

2014-08-07 Thread Zhun Shen
Hi Cheng, I replaced Guava 15.0 with Guava 14.0.1 in my spark classpath, the problem was solved. So your method is correct. It proved that this issue was caused by AWS EMR (ami-version 3.1.0) libs which include Guava 15.0. Many thanks and see you in the first Spark User Beijing Meetup tomorrow.

Re: Spark: Could not load native gpl library

2014-08-07 Thread Xiangrui Meng
Is the GPL library only available on the driver node? If that is the case, you need to add them to `--jars` option of spark-submit. -Xiangrui On Thu, Aug 7, 2014 at 6:59 PM, Jikai Lei wrote: > I had the following error when trying to run a very simple spark job (which > uses logistic regression w

Re: KMeans Input Format

2014-08-07 Thread Xiangrui Meng
Besides durin's suggestion, please also confirm driver and executor memory in the WebUI, since they are small according to the log: 14/08/07 19:59:10 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 34.6 KB, free 303.3 MB) -Xiangrui -

Re: questions about MLLib recommendation models

2014-08-07 Thread Xiangrui Meng
ratings.map{ case Rating(u,m,r) => { val pred = model.predict(u, m) (r - pred)*(r - pred) } }.mean() The code doesn't work because the userFeatures and productFeatures stored in the model are RDDs. You tried to serialize them into the task closure, and execute `model.predict` on an execu

Spark: Could not load native gpl library

2014-08-07 Thread Jikai Lei
I had the following error when trying to run a very simple spark job (which uses logistic regression with SGD in mllib): ERROR GPLNativeCodeLoader: Could not load native gpl library java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path at java.lang.ClassLoader.loadLibrary(Clas

Re: Missing SparkSQLCLIDriver and Beeline drivers in Spark

2014-08-07 Thread Cheng Lian
Things have changed a bit in the master branch, and the SQL programming guide in master branch actually doesn’t apply to branch-1.0-jdbc. In branch-1.0-jdbc, Hive Thrift server and Spark SQL CLI are included in the hive profile and are thus not enabled by default. You need to either - pass -Ph

Re: [MLLib]:choosing the Loss function

2014-08-07 Thread Burak Yavuz
The following code will allow you to run Logistic Regression using L-BFGS: val lbfgs = new LBFGS(new LogisticGradient(), new SquaredL2Updater()) lbfgs.setMaxNumIterations(numIterations).setRegParam(regParam).setConvergenceTol(tol).setNumCorrections(numCor) val weights = lbfgs.optimize(data, initi

Re: [MLLib]:choosing the Loss function

2014-08-07 Thread Evan R. Sparks
The loss functions are represented in the various names of the model families. SVM is hinge loss, LogisticRegression is logistic loss, LinearRegression is linear loss. These are used internally as arguments to the SGD and L-BFGS optimizers. On Thu, Aug 7, 2014 at 6:31 PM, SK wrote: > Hi, > > Ac

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread x
The SVD computed result already contains descending order of singular values, you can get the biggest eigenvalue. --- val svd = matrix.computeSVD(matrix.numCols().toInt, computeU = true) val U: RowMatrix = svd.U val s: Vector = svd.s val V: Matrix = svd.V U.rows.toArray.take(1).foreac

[MLLib]:choosing the Loss function

2014-08-07 Thread SK
Hi, According to the MLLib guide, there seems to be support for different loss functions. But I could not find a command line parameter to choose the loss function but only found regType to choose the regularization. Does MLLib support a parameter to choose the loss function? thanks -- View t

Re: Regularization parameters

2014-08-07 Thread SK
What is the definition of regParam and what is the range of values it is allowed to take? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Regularization-parameters-tp11601p11737.html Sent from the Apache Spark User List mailing list archive at Nabbl

Re: memory issue on standalone master

2014-08-07 Thread Baoqiang Cao
My problem was that I didn’t know how to add. For what might be worthy, it was solved by editing the spark-env.sh. Thanks anyway! Baoqiang Cao Blog: http://baoqiang.org Email: bqcaom...@gmail.com On Aug 7, 2014, at 3:27 PM, maddenpj wrote: > It looks like your Java heap space is too low: -

Re: Use SparkStreaming to find the max of a dataset?

2014-08-07 Thread Tathagata Das
You can do the following. var globalMax = ... dstreamOfNumericalType.foreachRDD( rdd => { globalMax = math.max(rdd.max, globalMax) }) globalMax will keep getting updated after every batch TD On Thu, Aug 7, 2014 at 5:31 PM, bumble123 wrote: > I can't figure out how to use Spark Streami

Use SparkStreaming to find the max of a dataset?

2014-08-07 Thread bumble123
I can't figure out how to use Spark Streaming to find the max of a 5 second batch of data and keep updating the max every 5 seconds. How would I do this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkStreaming-to-find-the-max-of-a-dataset-tp11734.

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-07 Thread Tathagata Das
That is required for driver fault-tolerance, as well as for some transformations like updateSTateByKey that persist information across batches. It must be a HDFS directory when running on a cluster. TD On Thu, Aug 7, 2014 at 4:25 PM, salemi wrote: > That is correct. I do scc.checkpOint("check

Re: Spark Streaming- Input from Kafka, output to HBase

2014-08-07 Thread JiajiaJing
Hi Khanderao and TD Thank you very much for your reply and the new example. I have resolved the problem. The zookeeper port I used wasn't right, the default port is not the one that I suppose to use. So I set the "hbase.zookeeper.property.clientPort" to the correct port and everything worked. Be

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-07 Thread salemi
That is correct. I do scc.checkpOint("checkpoint"). Why is the checkpoint required? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-reduceByWindow-reduceFunc-invReduceFunc-windowDuration-slideDuration-tp11591p11731.html Sent from the Apache

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-07 Thread Tathagata Das
Are you running on a cluster but giving a local path in ssc.checkpoint(...) ? TD On Thu, Aug 7, 2014 at 3:24 PM, salemi wrote: > Hi, > > Thank you or your help. With the new code I am getting the following error > in the driver. What is going wrong here? > > 14/08/07 13:22:28 ERROR JobSchedule

Re: PySpark, numpy arrays and binary data

2014-08-07 Thread Davies Liu
On Thu, Aug 7, 2014 at 12:06 AM, Rok Roskar wrote: > sure, but if you knew that a numpy array went in on one end, you could safely > use it on the other end, no? Perhaps it would require an extension of the RDD > class and overriding the colect() method. > >> Could you give a short example about

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-07 Thread salemi
Hi, Thank you or your help. With the new code I am getting the following error in the driver. What is going wrong here? 14/08/07 13:22:28 ERROR JobScheduler: Error running job streaming job 1407450148000 ms.0 org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[4528] at apply at List.sca

Re: All of the tasks have been completed but the Stage is still shown as "Active"?

2014-08-07 Thread anthonyjschu...@gmail.com
Similarly, I am seeing tasks moved to the "completed" section which apparently haven't finished all elements... (succeeded/total < 1)... is this related? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/All-of-the-tasks-have-been-completed-but-the-Stage-is-st

Re: SparkR : lapplyPartition transforms the data in vertical format

2014-08-07 Thread Pranay Dave
Hello Zongheng Infact the problem is in lapplyPartition lapply gives output as 1,1 2,2 3,3 ... 10,10 However lapplyPartition gives output as 55, NA 55, NA Why lapply output is horizontal and lapplyPartition is vertical ? Here is my code library(SparkR) sc <- sparkR.init("local") lines <- text

Missing SparkSQLCLIDriver and Beeline drivers in Spark

2014-08-07 Thread ajatix
Hi I wish to migrate from shark to the spark-sql shell, where I am facing some difficulties in setting up. I cloned the "branch-1.0-jdbc" to test out the spark-sql shell, but I am unable to run it after building the source. I've tried two methods for building (with Hadoop 1.0.4) - sbt/sbt assem

Re: PySpark + executor lost

2014-08-07 Thread Davies Liu
What is the environment ? YARN or Mesos or Standalone? It will be more helpful if you could show more loggings. On Wed, Aug 6, 2014 at 7:25 PM, Avishek Saha wrote: > Hi, > > I get a lot of executor lost error for "saveAsTextFile" with PySpark > and Hadoop 2.4. > > For small datasets this error o

Lost executors

2014-08-07 Thread rpandya
I'm running into a problem with executors failing, and it's not clear what's causing it. Any suggestions on how to diagnose & fix it would be appreciated. There are a variety of errors in the logs, and I don't see a consistent triggering error. I've tried varying the number of executors per machin

Re: KMeans Input Format

2014-08-07 Thread durin
Not all memory can be used for Java heap space, so maybe it does run out. Could you try repartitioning the data? To my knowledge you shouldn't be thrown out as long as a single partition fits into memory, even if the whole dataset does not. To do that, exchange val train = parsedData.cache() wit

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
LOL! Glad it solved it. TD On Thu, Aug 7, 2014 at 2:23 PM, Padmanabhan, Mahesh (contractor) < mahesh.padmanab...@twc-contractor.com> wrote: > Slap my head moment – using rdd.context solved it! > > Thanks TD, > > Mahesh > > From: Tathagata Das > Date: Thursday, August 7, 2014 at 3:06 PM > To: M

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread contractor
Slap my head moment – using rdd.context solved it! Thanks TD, Mahesh From: Tathagata Das mailto:tathagata.das1...@gmail.com>> Date: Thursday, August 7, 2014 at 3:06 PM To: Mahesh Padmanabhan mailto:mahesh.padmanab...@twc-contractor.com>> Cc: amit mailto:amit.codenam...@gmail.com>>, "u...@spar

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
Well I dont see the rdd in the foreachRDD being passed into the A.func1() so I am not sure what is purpose of the function. Assuming that you do want to pass on that RDD into that function, and also want to have access to the sparkContext, you can only pass on the RDD and then access the associated

Re: trouble with saveAsParquetFile

2014-08-07 Thread Yin Huai
The PR is https://github.com/apache/spark/pull/1840. On Thu, Aug 7, 2014 at 1:48 PM, Yin Huai wrote: > Actually, the issue is if values of a field are always null (or this field > is missing), we cannot figure out the data type. So, we use NullType (it is > an internal data type). Right now, we

Re: trouble with saveAsParquetFile

2014-08-07 Thread Yin Huai
Actually, the issue is if values of a field are always null (or this field is missing), we cannot figure out the data type. So, we use NullType (it is an internal data type). Right now, we have a step to convert the data type from NullType to StringType. This logic in the master has a bug. We will

Re: trouble with saveAsParquetFile

2014-08-07 Thread Brad Miller
Thanks Yin! best, -Brad On Thu, Aug 7, 2014 at 1:39 PM, Yin Huai wrote: > Hi Brad, > > It is a bug. I have filed https://issues.apache.org/jira/browse/SPARK-2908 > to track it. It will be fixed soon. > > Thanks, > > Yin > > > On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller > wrote: > >> Hi All,

RE: Save an RDD to a SQL Database

2014-08-07 Thread Jim Donahue
Depending on what you mean by "save," you might be able to use the Twitter Storehaus package to do this. There was a nice talk about this at a Spark meetup -- "Stores, Monoids and Dependency Injection - Abstractions for Spark Streaming Jobs." Video here: https://www.youtube.com/watch?v=C7gWtx

Re: trouble with saveAsParquetFile

2014-08-07 Thread Yin Huai
Hi Brad, It is a bug. I have filed https://issues.apache.org/jira/browse/SPARK-2908 to track it. It will be fixed soon. Thanks, Yin On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller wrote: > Hi All, > > I'm having a bit of trouble with nested data structures in pyspark with > saveAsParquetFile.

Re: memory issue on standalone master

2014-08-07 Thread maddenpj
It looks like your Java heap space is too low: -Xmx512m. It's only using .5G of RAM, try bumping this up -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/memory-issue-on-standalone-master-tp11610p11711.html Sent from the Apache Spark User List mailing list ar

Re: questions about MLLib recommendation models

2014-08-07 Thread Burak Yavuz
Hi Jay, I've had the same problem you've been having in Question 1 with a synthetic dataset. I thought I wasn't producing the dataset well enough. This seems to be a bug. I will open a JIRA for it. Instead of using: ratings.map{ case Rating(u,m,r) => { val pred = model.predict(u, m) (r

Re: questions about MLLib recommendation models

2014-08-07 Thread Sean Owen
On Thu, Aug 7, 2014 at 9:06 PM, Jay Hutfles wrote: > 0,0,5 > 0,1,5 > 0,2,0 > 0,3,0 > 1,0,5 > 1,3,0 > 2,1,4 > 2,2,0 > 3,0,0 > 3,1,0 > 3,2,5 > 3,3,4 > 4,0,0 > 4,1,0 > 4,2,5 > val rank = 10 This is likely the problem? your rank is actually larger than the number of users or items. The error could p

questions about MLLib recommendation models

2014-08-07 Thread Jay Hutfles
I have a few questions regarding a collaborative filtering model, and was hoping for some recommendations (no pun intended...) *Setup* I have a csv file with user/movie/ratings named unimaginatively 'movies.csv'. Here are the contents: 0,0,5 0,1,5 0,2,0 0,3,0 1,0,5 1,3,0 2,1,4 2,2,0 3,0,0 3,1,0

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread contractor
Thanks TD, Amit. I think I figured out where the problem is through the process of commenting out individual lines of code one at a time :( Can either of you help me find the right solution? I tried creating the SparkContext outside the foreachRDD but that didn’t help. I have an object (let’s

Re: Spark Streaming Workflow Validation

2014-08-07 Thread Dan H.
Yes, thanks, I did in fact mean reduceByKey(), thus allowing the convenience method process the summation by key. Thanks for your feedback! DH -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Workflow-Validation-tp11677p11706.html Sent from

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
>From the extended info, I see that you have a function called createStreamingContext() in your code. Somehow that is getting referenced in in the foreach function. Is the whole foreachRDD code inside the createStreamingContext() function? Did you try marking the ssc field as transient? Here is a

Re: Save an RDD to a SQL Database

2014-08-07 Thread Vida Ha
That's a good idea - to write to files first and then load. Thanks. On Thu, Aug 7, 2014 at 11:26 AM, Flavio Pompermaier wrote: > Isn't sqoop export meant for that? > > > http://hadooped.blogspot.it/2013/06/apache-sqoop-part-3-data-transfer.html?m=1 > On Aug 7, 2014 7:59 PM, "Nicholas Chammas"

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread amit
There is one more configuration option called spark.closure.serializer that can be used to specify serializer for closures. Maybe in the the class you have Streaming Context as a field, so when spark tries to serialize the whole class it uses the spark.closure.serializer to serialize even the stre

Re: [Compile error] Spark 1.0.2 against cloudera 2.0.0-cdh4.6.0 error

2014-08-07 Thread Marcelo Vanzin
I think Cloudera only started adding Spark to CDH4 starting with 4.6, so maybe that's the minimum if you want to try out Spark on CDH4. On Thu, Aug 7, 2014 at 11:22 AM, Sean Owen wrote: > Yep, this command given in the Spark docs is correct: > > mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -D

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread contractor
Does this help? I can’t figure out anything new from this extra information. Thanks, Mahesh 2014-08-07 12:27:00,170 [spark-akka.actor.default-dispatcher-4] ERROR akka.actor.OneForOneStrategy - org.apache.spark.streaming.StreamingContext - field (class "com.twc.needle.ep.EventPersister$$

Re: Save an RDD to a SQL Database

2014-08-07 Thread Flavio Pompermaier
Isn't sqoop export meant for that? http://hadooped.blogspot.it/2013/06/apache-sqoop-part-3-data-transfer.html?m=1 On Aug 7, 2014 7:59 PM, "Nicholas Chammas" wrote: > Vida, > > What kind of database are you trying to write to? > > For example, I found that for loading into Redshift, by far the ea

Re: [Compile error] Spark 1.0.2 against cloudera 2.0.0-cdh4.6.0 error

2014-08-07 Thread Sean Owen
Yep, this command given in the Spark docs is correct: mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package and while I also would hope that this works, it doesn't compile: mvn -Pyarn -Dhadoop.version=2.0.0-cdh4.6.0 -DskipTests clean package I believe later 4.x includes eff

Re: KMeans Input Format

2014-08-07 Thread AlexanderRiggers
Thanks for your answers. The dataset is only 400MB, so I shouldn't run out of memory. I restructured my code now, because I forgot to cache my dataset and set down number of iterations to 2, but still get kicked out of Spark. Did I cache the data wrong (sorry not an expert): scala> import org.apac

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Shivaram Venkataraman
If you just want to find the top eigenvalue / eigenvector you can do something like the Lanczos method. There is a description of a MapReduce based algorithm in Section 4.2 of [1] [1] http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf On Thu, Aug 7, 2014 at 10:54 AM, Li Pu wrote: > @Miles

Re: Save an RDD to a SQL Database

2014-08-07 Thread Nicholas Chammas
Vida, What kind of database are you trying to write to? For example, I found that for loading into Redshift, by far the easiest thing to do was to save my output from Spark as a CSV to S3, and then load it from there into Redshift. This is not a slow as you think, because Spark can write the outp

trouble with saveAsParquetFile

2014-08-07 Thread Brad Miller
Hi All, I'm having a bit of trouble with nested data structures in pyspark with saveAsParquetFile. I'm running master (as of yesterday) with this pull request added: https://github.com/apache/spark/pull/1802. *# these all work* > sqlCtx.jsonRDD(sc.parallelize(['{"record": null}'])).saveAsParquet

Re: Spark Streaming Workflow Validation

2014-08-07 Thread Tathagata Das
I am not sure if it is a typo-error or not, but how are you using groupByKey to get the summed_values? Assuming you meant reduceByKey(), these workflows seems pretty efficient. TD On Thu, Aug 7, 2014 at 10:18 AM, Dan H. wrote: > I wanted to post for validation to understand if there is more ef

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Li Pu
@Miles, the latest SVD implementation in mllib is partially distributed. Matrix-vector multiplication is computed among all workers, but the right singular vectors are all stored in the driver. If your symmetric matrix is n x n and you want the first k eigenvalues, you will need to fit n x k double

Re: Save an RDD to a SQL Database

2014-08-07 Thread Vida Ha
The use case I was thinking of was outputting calculations made in Spark into a SQL database for the presentation layer to access. So in other words, having a Spark backend in Java that writes to a SQL database and then having a Rails front-end that can display the data nicely. On Thu, Aug 7, 20

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Evan R. Sparks
Reza Zadeh has contributed the distributed implementation of (Tall/Skinny) SVD (http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html), which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark 1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data

Re: How to read a multipart s3 file?

2014-08-07 Thread Sean Owen
That won't be it, since you can see from the directory listing that there are no data files under test -- only "_" files and dirs. The output looks like it was written, or partially written at least, but didn't finish, in that the part-* files were never moved to the target dir. I don't know why, b

Re: Low Performance of Shark over Spark.

2014-08-07 Thread Xiangrui Meng
Did you cache the table? There are couple ways of caching a table in Shark: https://github.com/amplab/shark/wiki/Shark-User-Guide On Thu, Aug 7, 2014 at 6:51 AM, wrote: > Dear all, > > I am using Spark 0.9.2 in Standalone mode. Hive and HDFS in CDH 5.1.0. > > 6 worker nodes each with memory 96GB

How to Start a JOB programatically from an EC2 machine?

2014-08-07 Thread SankarS
Hello All, If I execute a hive sql from ec2 machine, I am getting Broken pipe exception often, I want to execute the job programatically. is there any way available? Please guide me Thanks and regards, SankarS -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.

Re: Regularization parameters

2014-08-07 Thread Xiangrui Meng
Then this may be a bug. Do you mind sharing the dataset that we can use to reproduce the problem? -Xiangrui On Thu, Aug 7, 2014 at 1:20 AM, SK wrote: > Spark 1.0.1 > > thanks > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Regularization-parameter

Re: [Compile error] Spark 1.0.2 against cloudera 2.0.0-cdh4.6.0 error

2014-08-07 Thread Marcelo Vanzin
Can you try with "-Pyarn" instead of "-Pyarn-alpha"? I'm pretty sure CDH4 ships with the newer Yarn API. On Thu, Aug 7, 2014 at 8:11 AM, linkpatrickliu wrote: > Hi, > > Following the "" document: > > # Cloudera CDH 4.2.0 > mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean packag

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
Can you enable the java flag -Dsun.io.serialization.extendedDebugInfo=true for driver in your driver startup-script? That should give an indication of the sequence of object references that lead to the StremaingContext being included in the closure. TD On Thu, Aug 7, 2014 at 10:23 AM, Padmanabha

Re: How to read a multipart s3 file?

2014-08-07 Thread sparkuser2345
Ashish Rangole wrote > Specify a folder instead of a file name for input and output code, as in: > > Output: > s3n://your-bucket-name/your-data-folder > > Input: (when consuming the above output) > > s3n://your-bucket-name/your-data-folder/* Unfortunately no luck: Exception in thread "main" o

Re: JVM Error while building spark

2014-08-07 Thread Sean Owen
(-incubator, +user) It's not Spark running out of memory, but SBT, so those env variables have no effect. They're options to Spark at runtime anyway, not compile time, and you're intending to compile I take it. SBT is a memory hog, and Spark is a big build. You will probably need to give it more

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread contractor
As a follow up, I commented out that entire code and I am still getting the exception. It may be related to what you are suggesting so are there any best practices so that I can audit other parts of the code? Thanks, Mahesh From: , Mahesh Padmanabhan mailto:mahesh.padmanab...@twc-contractor.co

Re: KMeans Input Format

2014-08-07 Thread Sean Owen
It's not running out of memory on the driver though, right? the executors may need more memory, or use more executors. --executory-memory would let you increase from the default of 512MB. On Thu, Aug 7, 2014 at 5:07 PM, Burak Yavuz wrote: > Hi, > > Could you try running spark-shell with the flag

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread contractor
Thanks TD but unfortunately that did not work. From: Tathagata Das mailto:tathagata.das1...@gmail.com>> Date: Thursday, August 7, 2014 at 10:55 AM To: Mahesh Padmanabhan mailto:mahesh.padmanab...@twc-contractor.com>> Cc: "user@spark.apache.org" mailto:user@spark.ap

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Sean Owen
(-incubator, +user) If your matrix is symmetric (and real I presume), and if my linear algebra isn't too rusty, then its SVD is its eigendecomposition. The SingularValueDecomposition object you get back has U and V, both of which have columns that are the eigenvectors. There are a few SVDs in the

Re: spark streaming actor receiver doesn't play well with kryoserializer

2014-08-07 Thread Tathagata Das
Another possible reason behind this maybe that there are two versions of Akka present in the classpath, which are interfering with each other. This could happen through many scenarios. 1. Launching Spark application with Scala brings in Akka from Scala, which interferes with Spark's Akka 2. Multip

Spark Streaming Workflow Validation

2014-08-07 Thread Dan H.
I wanted to post for validation to understand if there is more efficient way to achieve my goal. I'm currently performing this flow for two distinct calculations executing in parallel: 1) Sum key/value pair, by using a simple witnessed count(apply 1 to a mapToPair() and then groupByKey() 2) Su

Re: Spark Streaming- Input from Kafka, output to HBase

2014-08-07 Thread Tathagata Das
For future reference in this thread, a better set of examples than the MetricAggregatorHBase on the JIRA to look at are here https://github.com/tmalaska/SparkOnHBase On Thu, Aug 7, 2014 at 1:41 AM, Khanderao Kand wrote: > I hope this has been resolved, were u connected to right zookeeper? di

Re: spark streaming multiple file output paths

2014-08-07 Thread Tathagata Das
The problem boils down to how to write an RDD in that way. You could use the HDFS Filesystem API to write each partition directly. pairRDD.groupByKey().foreachPartition(iterator => iterator.map { case (key, values) => // Open an output stream to destination file /key/ // Write valu

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
It could be because of the variable "enableOpStat". Since its defined outside foreachRDD, referring to it inside the rdd.foreach is probably causing the whole streaming context being included in the closure. Scala funkiness. Try this, see if it works. msgCount.join(ddCount).foreachRDD((rdd: RDD[(I

Re: How to read a multipart s3 file?

2014-08-07 Thread paul
darkjh wrote > But in my experience, when reading directly from > s3n, spark create only 1 input partition per file, regardless of the file > size. This may lead to some performance problem if you have big files. This is actually not true, Spark uses the underlying hadoop input formats to read the

Re: Save an RDD to a SQL Database

2014-08-07 Thread chutium
right, Spark is more like to act as an OLAP, i believe no one will use spark as an OLTP, so there is always some question about how to share the data between these two platform efficiently and a more important is that most of enterprise BI tools rely on RDBMS or at least a JDBC/ODBC interface

Re: Initial job has not accepted any resources

2014-08-07 Thread Marcelo Vanzin
There are two problems that might be happening: - You're requesting more resources than the master has available, so your executors are not starting. Given your explanation this doesn't seem to be the case. - The executors are starting, but are having problems connecting back to the driver. In th

Re: How to read a multipart s3 file?

2014-08-07 Thread Ashish Rangole
Specify a folder instead of a file name for input and output code, as in: Output: s3n://your-bucket-name/your-data-folder Input: (when consuming the above output) s3n://your-bucket-name/your-data-folder/* On May 6, 2014 5:19 PM, "kamatsuoka" wrote: > I have a Spark app that writes out a file,

Re: reduceByKey to get all associated values

2014-08-07 Thread Evan R. Sparks
Specifically, reduceByKey expects a commutative/associative reduce operation, and will automatically do this locally before a shuffle, which means it acts like a "combiner" in MapReduce terms - http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions On Thu

Initial job has not accepted any resources

2014-08-07 Thread arnaudbriche
Hi, I'm trying a simple thing: create an RDD from a text file (~3GB) located in GlusterFS, which is mounted by all Spark cluster machines, and calling rdd.count(); but Spark never managed to complete the job, giving message like the following: WARN TaskSchedulerImpl: Initial job has not accepted

Re: KMeans Input Format

2014-08-07 Thread Burak Yavuz
Hi, Could you try running spark-shell with the flag --driver-memory 2g or more if you have more RAM available and try again? Thanks, Burak - Original Message - From: "AlexanderRiggers" To: u...@spark.incubator.apache.org Sent: Thursday, August 7, 2014 7:37:40 AM Subject: KMeans Input F

Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread contractor
Hello all, I am not sure what is going on – I am getting a NotSerializedException and initially I thought it was due to not registering one of my classes with Kryo but that doesn’t seem to be the case. I am essentially eliminating duplicates in a spark streaming application by using a “window”

JVM Error while building spark

2014-08-07 Thread Rasika Pohankar
Hello, I am trying to build Apache Spark version 1.0.1 on Ubuntu 12.04 LTS. After unzipping the file and running sbt/sbt assembly I get the following error : rasika@rasikap:~/spark-1.0.1$ sbt/sbt package Error occurred during initialization of VM Could not reserve enough space for object heap Err

Re: Save an RDD to a SQL Database

2014-08-07 Thread Nicholas Chammas
On Thu, Aug 7, 2014 at 11:25 AM, Cheng Lian wrote: > Maybe a little off topic, but would you mind to share your motivation of > saving the RDD into an SQL DB? Many possible reasons (Vida, please chime in with yours!): - You have an existing database you want to load new data into so ever

spark streaming multiple file output paths

2014-08-07 Thread Chen Song
In Spark Streaming, is there a way to write output to different paths based on the partition key? The saveAsTextFiles method will write output in the same directory. For example, if the partition key has a hour/day column and I want to separate DStream output into different directories by hour/day

Re: Save an RDD to a SQL Database

2014-08-07 Thread Nicholas Chammas
On Thu, Aug 7, 2014 at 11:08 AM, 诺铁 wrote: > what if network broken in half of the process? should we drop all data in > database and restart from beginning? The best way to deal with this -- which, unfortunately, is not commonly supported -- is with a two-phase commit that can span connection

Re: Save an RDD to a SQL Database

2014-08-07 Thread Cheng Lian
Maybe a little off topic, but would you mind to share your motivation of saving the RDD into an SQL DB? If you’re just trying to do further transformations/queries with SQL for convenience, then you may just use Spark SQL directly within your Spark application without saving them into DB: va

Re: Save an RDD to a SQL Database

2014-08-07 Thread Thomas Nieborowski
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala On Thu, Aug 7, 2014 at 8:08 AM, 诺铁 wrote: > I haven't seen people write directly to sql database, > mainly because it's difficult to deal with failure, > what if network broken in half of the pro

  1   2   >