Re: How to close resources shared in executor?

2014-10-16 Thread Fengyun RAO
Thanks, Ted. Util.Connection.close() should be called only once, so it can NOT be in a map function val result = rdd.map(line = { val table = Util.Connection.getTable(user) ... Util.Connection.close() } As you mentioned: Calling table.close() is the recommended approach.

Problems with ZooKeeper and key canceled

2014-10-16 Thread Malte
I have a spark cluster on mesos and when I run long running GraphX processing I receive a lot of the following two errors and one by one my slaves stop doing any work for the process until its idle. Any idea what is happening? First type of error message: INFO SendingConnection: Initiating

Re: How to close resources shared in executor?

2014-10-16 Thread Fengyun RAO
I may have misunderstood your point. val result = rdd.map(line = { val table = Util.Connection.getTable(user) ... table.close() } Did you mean this is enough, and there’s no need to call Util.Connection.close(), or HConnectionManager.deleteAllConnections()? Where is the documentation that

RE: Problem executing Spark via JBoss application

2014-10-16 Thread Mehdi Singer
Indeed it was a problem on the executor side… I have to figure out how to fix it now ;-) Thanks! Mehdi De : Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Envoyé : mercredi 15 octobre 2014 18:32 À : Mehdi Singer Cc : user@spark.apache.org Objet : Re: Problem executing Spark via JBoss

RE: Problem executing Spark via JBoss application

2014-10-16 Thread Jörn Franke
Do you create the application in context of the web service call? Then the application maybe killed after you return from the web service call. However, we would need to see what you do during the web service call, how you invoke the spark application Le 16 oct. 2014 08:50, Mehdi Singer

Re: How to add HBase dependencies and conf with spark-submit?

2014-10-16 Thread Fengyun RAO
Thanks, Soumitra Kumar, I didn’t know why you put hbase-protocol.jar in SPARK_CLASSPATH, while add hbase-protocol.jar, hbase-common.jar, hbase-client.jar, htrace-core.jar in --jar, but it did work. Actually, I put all these four jars in SPARK_CLASSPATH along with HBase conf directory. ​

RE: Problem executing Spark via JBoss application

2014-10-16 Thread Mehdi Singer
I solved my problem. It was due to a library version used by Spark (snappy-java) that is apparently not compatible with JBoss... I updated the lib version and it's working now. Jörn, this is what I'm doing in my web service call: - Create the Spark context - Create my JavaJdbcRDD - Count the

Re: distributing Scala Map datatypes to RDD

2014-10-16 Thread Jon Massey
Wow, it really was that easy! The implicit joining works a treat. Many thanks, Jon On 13 October 2014 22:58, Stephen Boesch java...@gmail.com wrote: is the following what you are looking for? scala sc.parallelize(myMap.map{ case (k,v) = (k,v) }.toSeq) res2:

Re: Spark can't find jars

2014-10-16 Thread Christophe Préaud
Hi, I have created a JIRA (SPARK-3967https://issues.apache.org/jira/browse/SPARK-3967), can you please confirm that you are hit by the same issue? Thanks, Christophe. On 15/10/2014 09:49, Christophe Préaud wrote: Hi Jimmy, Did you try my patch? The problem on my side was that the

Re: Application failure in yarn-cluster mode

2014-10-16 Thread Christophe Préaud
Hi, I have been able to reproduce this problem on our dev environment, I am fairly sure now that it is indeed a bug. As a consequence, I have created a JIRA (SPARK-3967https://issues.apache.org/jira/browse/SPARK-3967) for this issue, which is triggered when yarn.nodemanager.local-dirs (not

Re: spark1.0 principal component analysis

2014-10-16 Thread al123
Hi, I don't think anybody answered this question... fintis wrote How do I match the principal components to the actual features since there is some sorting? Would anybody be able to shed a little light on it since I too am struggling with this? Many thanks!! -- View this message in

spark-default.conf description

2014-10-16 Thread Kuromatsu, Nobuyuki
I'm running Spark1.1.0 on YARN(Hadoop-2.4.1) and try to use spark.yarn.appMasterEnv.* to execute some scripts. In spark-default.conf, I set environment variables like this, but this description is redundant. spark.yarn.appMasterEnv.SCRIPT_DIR /home/kuromtsu/spark-1.1.0/scripts

Re: Getting the value from DStream[Int]

2014-10-16 Thread Akhil Das
you can do score.print to see the values, and if you want to do some operations with these values then you have to do a map on that dstream (score.map(myInt = myInt + 5)) Thanks Best Regards On Thu, Oct 16, 2014 at 5:19 AM, SK skrishna...@gmail.com wrote: Hi, As a result of a reduction

Re: submitted uber-jar not seeing spark-assembly.jar at worker

2014-10-16 Thread Tamas Sandor
Hello Owen, I used maven build to make use of the guava collections package renaming, sbt keeps the old Guava package names intact... Finally it turned out that I have just upgraded to the latest version of spark-cassandra-connector: 1.1.0-alpha3 and when I step back to 1.1.0-alpha2 everything

GraphX Performance

2014-10-16 Thread Jianwei Li
Hi, I am writting to know if there is any performance data on GraphX? I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges from( http://socialcomputing.asu.edu/pages/dataset http://socialcomputing.asu.edu/pages/datasetss). For PageRank algorithm, the job can not be

Re: How to write data into Hive partitioned Parquet table?

2014-10-16 Thread Michael Armbrust
Support for dynamic partitioning is available in master and will be part of Spark 1.2 On Thu, Oct 16, 2014 at 1:08 AM, Banias H banias4sp...@gmail.com wrote: I got tipped by an expert that the error of Unsupported language features in query that I had was due to the fact that SparkSQL does not

Unit testing: Mocking out Spark classes

2014-10-16 Thread Saket Kumar
Hello all, I am trying to unit test my classes involved my Spark job. I am trying to mock out the Spark classes (like SparkContext and Broadcast) so that I can unit test my classes in isolation. However I have realised that these are classes instead of traits. My first question is why? It is

Re: SparkSQL: set hive.metastore.warehouse.dir in CLI doesn't work

2014-10-16 Thread Cheng Lian
The warehouse location need to be specified before the |HiveContext| initialization, you can set it via: |./bin/spark-sql --hiveconf hive.metastore.warehouse.dir=/home/spark/hive/warehouse | On 10/15/14 8:55 PM, Hao Ren wrote: Hi, The following query in sparkSQL 1.1.0 CLI doesn't work.

Re: YARN deployment of Spark and Thrift JDBC server

2014-10-16 Thread Cheng Lian
On 10/16/14 12:44 PM, neeraj wrote: I would like to reiterate that I don't have Hive installed on the Hadoop cluster. I have some queries on following comment from Cheng Lian-2: The Thrift server is used to interact with existing Hive data, and thus needs Hive Metastore to access Hive catalog.

Re: [SparkSQL] Convert JavaSchemaRDD to SchemaRDD

2014-10-16 Thread Cheng Lian
Why do you need to convert a JavaSchemaRDD to SchemaRDD? Are you trying to use some API that doesn't exist in JavaSchemaRDD? On 10/15/14 5:50 PM, Earthson wrote: I don't know why the JavaSchemaRDD.baseSchemaRDD is private[sql]. And I found that DataTypeConversions is protected[sql]. Finally I

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-16 Thread Yin Huai
Hello Terry, I guess you hit this bug https://issues.apache.org/jira/browse/SPARK-3559. The list of needed column ids was messed up. Can you try the master branch or apply the code change https://github.com/apache/spark/commit/e10d71e7e58bf2ec0f1942cb2f0602396ab866b4 to your 1.1 and see if the

Re: Play framework

2014-10-16 Thread Daniel Siegmann
We execute Spark jobs from a Play application but we don't use spark-submit. I don't know if you really want to use spark-submit, but if not you can just create a SparkContext programmatically in your app. In development I typically run Spark locally. Creating the Spark context is pretty trivial:

Re: Unit testing: Mocking out Spark classes

2014-10-16 Thread Daniel Siegmann
Mocking these things is difficult; executing your unit tests in a local Spark context is preferred, as recommended in the programming guide http://spark.apache.org/docs/latest/programming-guide.html#unit-testing. I know this may not technically be a unit test, but it is hopefully close enough.

Help required on exercise Data Exploratin using Spark SQL

2014-10-16 Thread neeraj
Hi, I'm exploring an exercise Data Exploratin using Spark SQL from Spark Summit 2014. While running command val wikiData = sqlContext.parquetFile(data/wiki_parquet).. I'm getting the following output which doesn't match with the expected output. *Output i'm getting*: val wikiData1 =

Re: Larger heap leads to perf degradation due to GC

2014-10-16 Thread Akshat Aranya
I just want to pitch in and say that I ran into the same problem with running with 64GB executors. For example, some of the tasks take 5 minutes to execute, out of which 4 minutes are spent in GC. I'll try out smaller executors. On Mon, Oct 6, 2014 at 6:35 PM, Otis Gospodnetic

Spark SQL DDL, DML commands

2014-10-16 Thread neeraj
Hi, Does Spark SQL have DDL, DML commands to be executed directly. If yes, please share the link. If No, please help me understand why is it not there? Regards, Neeraj -- View this message in context:

Re: How to close resources shared in executor?

2014-10-16 Thread Ted Yu
Which hbase release are you using ? Let me refer to 0.94 code hbase. Take a look at the following method in src/main/java/org/apache/hadoop/hbase/client/HTable.java : public void close() throws IOException { ... if (cleanupConnectionOnClose) { if (this.connection != null) {

Re: YARN deployment of Spark and Thrift JDBC server

2014-10-16 Thread neeraj
1. I'm trying to use Spark SQL as data source.. is it possible? 2. Please share the link of ODBC/ JDBC drivers at databricks.. i'm not able to find the same. -- View this message in context:

TaskNotSerializableException when running through Spark shell

2014-10-16 Thread Akshat Aranya
Hi, Can anyone explain how things get captured in a closure when runing through the REPL. For example: def foo(..) = { .. } rdd.map(foo) sometimes complains about classes not being serializable that are completely unrelated to foo. This happens even when I write it such: object Foo { def

Re: YARN deployment of Spark and Thrift JDBC server

2014-10-16 Thread Cheng Lian
On 10/16/14 10:48 PM, neeraj wrote: 1. I'm trying to use Spark SQL as data source.. is it possible? Unfortunately Spark SQL ODBC/JDBC support are based on the Thrift server, so at least you need HDFS and a working Hive Metastore instance (used to persist catalogs) to make things work. 2.

Re: Help required on exercise Data Exploratin using Spark SQL

2014-10-16 Thread Cheng Lian
Hi Neeraj, The Spark Summit 2014 tutorial uses Spark 1.0. I guess you're using Spark 1.1? Parquet support got polished quite a bit since then, and changed the string representation of the query plan, but this output should be OK :) Cheng On 10/16/14 10:45 PM, neeraj wrote: Hi, I'm

Folding an RDD in order

2014-10-16 Thread Michael Misiewicz
Hi, I'm working on a problem where I'd like to sum items in an RDD *in order (* approximately*)*. I am currently trying to implement this using a fold, but I'm having some issues because the sorting key of my data is not the same as the folding key for my data. I have data that looks like this:

PySpark Error on Windows with sc.wholeTextFiles

2014-10-16 Thread Griffiths, Michael (NYC-RPM)
Hi, I'm running into an error on Windows (x64, 8.1) running Spark 1.1.0 (pre-builet for Hadoop 2.4: spark-1.1.0-bin-hadoop2.4.tgzhttp://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.4.tgz) with Java SE Version 8 Update 20 (build 1.8.0_20-b26); just getting started with Spark. When

Re: Spark SQL DDL, DML commands

2014-10-16 Thread Yi Tian
what is your meaning of executed directly”? Best Regards, Yi Tian tianyi.asiai...@gmail.com On Oct 16, 2014, at 22:50, neeraj neeraj_gar...@infosys.com wrote: Hi, Does Spark SQL have DDL, DML commands to be executed directly. If yes, please share the link. If No, please help me

Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread _soumya_
Hi, my programming model requires me to generate multiple RDDs for various datasets across a single run and then run an action on it - E.g. MyFunc myFunc = ... //It implements VoidFunction //set some extra variables - all serializable ... for (JavaRDDString rdd: rddList) { ...

Standalone Apps and ClassNotFound

2014-10-16 Thread Ashic Mahtab
I'm relatively new to Spark and have got a couple of questions: * I've got an IntelliJ SBT project that's using Spark Streaming with a custom RabbitMQ receiver in the same project. When I run it against local[2], all's well. When I put in spark://masterip:7077, I get a ClassNotFoundException

Re: Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread _soumya_
Excuse me - the line inside the loop should read: rdd.foreach(myFunc) - not sc. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-an-action-inside-a-loop-across-multiple-RDDs-java-io-NotSerializableException-tp16580p16581.html Sent from the Apache

Re: Folding an RDD in order

2014-10-16 Thread Cheng Lian
Hi Michael, I'm not sure I fully understood your question, but I think RDD.aggregate can be helpful in your case. You can see it as a more general version of fold. Cheng On 10/16/14 11:15 PM, Michael Misiewicz wrote: Hi, I'm working on a problem where I'd like to sum items in an RDD /in

Re: Spark SQL DDL, DML commands

2014-10-16 Thread Cheng Lian
I guess you're referring to the simple SQL dialect recognized by the SqlParser component. Spark SQL supports most DDL and DML of Hive. But the simple SQL dialect is still very limited. Usually it's used together with some Spark application written in Java/Scala/Python. Within a Spark

Re: Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread Cheng Lian
You can first union them into a single RDD and then call |foreach|. In Scala: |rddList.reduce(_.union(_)).foreach(myFunc) | For the serialization issue, I don’t have any clue unless more code can be shared. On 10/16/14 11:39 PM, /soumya/ wrote: Hi, my programming model requires me to

Re: How to add HBase dependencies and conf with spark-submit?

2014-10-16 Thread Soumitra Kumar
Great, it worked. I don't have an answer what is special about SPARK_CLASSPATH vs --jars, just found the working setting through trial an error. - Original Message - From: Fengyun RAO raofeng...@gmail.com To: Soumitra Kumar kumar.soumi...@gmail.com Cc: user@spark.apache.org,

Re: Folding an RDD in order

2014-10-16 Thread Michael Misiewicz
Thanks for the suggestion! That does look really helpful, I see what you mean about it being more general than fold. I think I will replace my fold with aggregate - it should give me more control over the process. I think the problem will still exist though - which is that I can't get the correct

Re: Folding an RDD in order

2014-10-16 Thread Michael Misiewicz
I note that one of the listing variants of aggregateByKey accepts a partitioner as an argument: def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)] Would it be possible to extract my sorted parent's

Re: PySpark Error on Windows with sc.wholeTextFiles

2014-10-16 Thread Davies Liu
It's a bug, could you file a JIRA for this? Thanks! Davies On Thu, Oct 16, 2014 at 8:28 AM, Griffiths, Michael (NYC-RPM) michael.griffi...@reprisemedia.com wrote: Hi, I’m running into an error on Windows (x64, 8.1) running Spark 1.1.0 (pre-builet for Hadoop 2.4:

Re: Folding an RDD in order

2014-10-16 Thread Cheng Lian
RDD.aggregate doesn’t require the RDD elements to be pairs, so you don’t need to use user_id to be the key or the RDD. For example, you can use an empty Map as the zero value of the aggregation. The key of the Map is the user_id you extracted from each tuple, and the value is the aggregated

Re: Play framework

2014-10-16 Thread Surendranauth Hiraman
Mohammed, Jumping in for Daniel, we actually address the configuration issue by pulling values from environment variables or command line options. Maybe that may handle at least some of your needs. For the akka issue, here is the akka version we include in build.sbt: com.typesafe.akka %%

Re: Spark output to s3 extremely slow

2014-10-16 Thread Anny Chen
Hi Rafal, Thanks for the explanation and solution! I need to write maybe 100 GB to s3. I will try your way and see whether it works for me. Thanks again! On Wed, Oct 15, 2014 at 1:44 AM, Rafal Kwasny m...@entropy.be wrote: Hi, How large is the dataset you're saving into S3? Actually saving

ALS implicit error pyspark

2014-10-16 Thread Gen
Hi, I am trying to use ALS.trainImplicit method in the pyspark.mllib.recommendation. However it didn't work. So I tried use the example in the python API documentation such as: /r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) ratings = sc.parallelize([r1, r2, r3]) model =

Spark assembly for YARN/CDH5

2014-10-16 Thread Philip Ogren
Does anyone know if there Spark assemblies are created and available for download that have been built for CDH5 and YARN? Thanks, Philip - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands,

Re: Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread _soumya_
Sorry - I'll furnish some details below. However, union is not an option for the business logic I have. The function will generate a specific file based on a variable passed in as the setter for the function. This variable changes with each RDD. I annotated the log line where the first run

Re: How to make operation like cogrop() , groupbykey() on pair RDD = [ [ ], [ ] , [ ] ]

2014-10-16 Thread Gen
Hi, You just need add list() in the sorted function. For example, map((lambda (x,y): (x, (list(y[0]), list(y[1], sorted(list(rdd1.cogroup(rdd2).collect( I think you just forget the list... PS: your post has NOT been accepted by the mailing list yet. Best Gen pm wrote Hi ,

Re: TaskNotSerializableException when running through Spark shell

2014-10-16 Thread Jimmy McErlain
I actually only ran into this issue recently after we upgraded to Spark 1.1. Within the REPL for Spark 1.0 everything works fine but within the REPL for 1.1, it is not. FYI I am also only doing simple regex matching functions within an RDD... Now when I am running the same code as App everything

Re: Play framework

2014-10-16 Thread US Office Admin
​We integrated Spark into Play and use SparkSQL extensively on an ec2 spark cluster on Hadoop hdfs 1.2.1 and tachyon 0.4. Step 1: Create a play scala application as usual Step 2. In Build.sbt put all your spark dependencies. What works for us is Play 2.2.3 Scala 2.10.4 Spark 1.1. We have Akka

reverse an rdd

2014-10-16 Thread ll
hello... what is the best way to iterate through an rdd backward (last element first, first element last)? thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/reverse-an-rdd-tp16602.html Sent from the Apache Spark User List mailing list archive at

Re: Can's create Kafka stream in spark shell

2014-10-16 Thread Gary Zhao
Thanks Akhil. I tried spark-submit and saw the same issue. I double checked the versions and they look ok. Are you seeing any obvious issues? sbt: name := Simple Project version := 1.1 scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.0,

Re: Spark assembly for YARN/CDH5

2014-10-16 Thread Sean Owen
https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-assembly_2.10/ ? I'm not sure why the 5.2 + 1.1 final artifacts don't show up there yet though. On Thu, Oct 16, 2014 at 2:12 PM, Philip Ogren philip.og...@oracle.com wrote: Does anyone know if there Spark

Re: Can's create Kafka stream in spark shell

2014-10-16 Thread Akhil Das
Can you try: sbt: name := Simple Project version := 1.1 scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.0, org.apache.spark %% spark-streaming % 1.1.0, org.apache.spark %% spark-streaming-kafka % 1.1.0 ) Thanks Best Regards On

Re: reverse an rdd

2014-10-16 Thread Sean Owen
Since you're concerned with the particular ordering, you will need to sort your RDD to ensure the ordering you have in mind. Simply reverse the Ordering with Ordering.reverse() and sort by that instead, and then use toLocalIterator() I suppose. Depending on what you're really trying to achieve,

Re: ALS implicit error pyspark

2014-10-16 Thread Gen
I tried the same data with scala. It works pretty well. It seems that it is the problem of pyspark. In the console, it shows the following logs: Traceback (most recent call last): File stdin, line 1, in module * File /root/spark/python/pyspark/mllib/recommendation.py, line 76, in trainImplicit

Re: ALS implicit error pyspark

2014-10-16 Thread Davies Liu
It seems a bug, Could you create a JIRA for it? thanks! Davies On Thu, Oct 16, 2014 at 12:27 PM, Gen gen.tan...@gmail.com wrote: I tried the same data with scala. It works pretty well. It seems that it is the problem of pyspark. In the console, it shows the following logs: Traceback (most

hi all

2014-10-16 Thread Paweł Szulc
Hi, I just wanted to say hi all to the Spark community. I'm developing some stuff right now using Spark (we've started very recently). As the API documentation of Spark is really really good, I like to get deeper knowledge of the internal stuff -you know, the goodies. Watching movies from Spark

Re: reverse an rdd

2014-10-16 Thread Paweł Szulc
Just to have this clear, can you answer with quick yes or no: Does it mean that when I create RDD from a file and I simply iterate through it like this: sc.textFile(some_text_file.txt).foreach(line = println(line)) then the actual lines might come in different order then they are in the file?

Re: reverse an rdd

2014-10-16 Thread Paweł Szulc
Nevermind, I've just run the code in the REPL. Indeed if we do not sort, then the order is totally random. Which actually makes sens if you think about it On Thu, Oct 16, 2014 at 9:58 PM, Paweł Szulc paul.sz...@gmail.com wrote: Just to have this clear, can you answer with quick yes or no:

How to name a DStream

2014-10-16 Thread Soumitra Kumar
Hello, I am debugging my code to find out what else to cache. Following is a line in log: 14/10/16 12:00:01 INFO TransformedDStream: Persisting RDD 6 for time 141348600 ms to StorageLevel(true, true, false, false, 1) at time 141348600 ms Is there a way to name a DStream? RDD has a

RE: Play framework

2014-10-16 Thread Mohammed Guller
Thanks, Suren and Raju. Raju – if I remember correctly, Play package command just creates a jar for your app. That jar file will not include other dependencies. So it is not really a full jar as you mentioned below. So how you are passing all the other dependency jars to spark? Can you share

Re: Exception while reading SendingConnection to ConnectionManagerId

2014-10-16 Thread Jimmy Li
Does anyone know anything re: this error? Thank you! On Wed, Oct 15, 2014 at 3:38 PM, Jimmy Li jimmy...@bluelabs.com wrote: Hi there, I'm running spark on ec2, and am running into an error there that I don't get locally. Here's the error: 11335 [handle-read-write-executor-3] ERROR

Re: TF-IDF in Spark 1.1.0

2014-10-16 Thread Burke Webster
Thanks for the response. Appreciate the help! Burke On Tue, Oct 14, 2014 at 3:00 PM, Xiangrui Meng men...@gmail.com wrote: You cannot recover the document from the TF-IDF vector, because HashingTF is not reversible. You can assign each document a unique ID, and join back the result after

Re: Spark assembly for YARN/CDH5

2014-10-16 Thread Marcelo Vanzin
Hi Philip, The assemblies are part of the CDH distribution. You can get them here: http://www.cloudera.com/content/cloudera/en/downloads/cdh/cdh-5-2-0.html As of Spark 1.1 (and, thus, CDH 5.2), assemblies are not published to maven repositories anymore (you can see commit [1] for details). [1]

Spark Bug? job fails to run when given options on spark-submit (but starts and fails without)

2014-10-16 Thread Michael Campbell
TL;DR - a spark SQL job fails with an OOM (Out of heap space) error. If given --executor-memory values, it won't even start. Even (!) if the values given ARE THE SAME AS THE DEFAULT. Without --executor-memory: 14/10/16 17:14:58 INFO TaskSetManager: Serialized task 1.0:64 as 14710 bytes in 1

Re: Can's create Kafka stream in spark shell

2014-10-16 Thread Gary Zhao
Same error. I saw someone reported the same issue, e.g. http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-kafka-error-td9106.html Should I use sbt assembly? It failed for deduplicate though. error] (*:assembly) deduplicate: different file contents found in the following:

Re: ALS implicit error pyspark

2014-10-16 Thread Davies Liu
On Thu, Oct 16, 2014 at 9:53 AM, Gen gen.tan...@gmail.com wrote: Hi, I am trying to use ALS.trainImplicit method in the pyspark.mllib.recommendation. However it didn't work. So I tried use the example in the python API documentation such as: /r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1,

Re: ALS implicit error pyspark

2014-10-16 Thread Davies Liu
Could you post the code that have problem with pyspark? thanks! Davies On Thu, Oct 16, 2014 at 12:27 PM, Gen gen.tan...@gmail.com wrote: I tried the same data with scala. It works pretty well. It seems that it is the problem of pyspark. In the console, it shows the following logs: Traceback

EC2 cluster set up and access to HBase in a different cluster

2014-10-16 Thread freedafeng
The plan is to create an EC2 cluster and run the (py) spark on it. Input data is from s3, output data goes to an hbase in a persistent cluster (also EC2). My questions are: 1. I need to install some software packages on all the workers (sudo apt-get install ...). Is there a better way to do this

scala: java.net.BindException?

2014-10-16 Thread ll
hello... does anyone know how to resolve this issue? i'm running this locally on my computer. keep getting this BindException. much appreciated. 14/10/16 17:48:13 WARN component.AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use

Re: ALS implicit error pyspark

2014-10-16 Thread Davies Liu
I can run the following code against Spark 1.1 sc = SparkContext() r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) ratings = sc.parallelize([r1, r2, r3]) model = ALS.trainImplicit(ratings, 1) Davies On Thu, Oct 16, 2014 at 2:45 PM, Davies Liu dav...@databricks.com wrote: Could you post the

Strange duplicates in data when scaling up

2014-10-16 Thread Jacob Maloney
I have a flatmap function that shouldn't possibly emit duplicates and yet it does. The output of my function is a HashSet so the function itself cannot output duplicates and yet I see many copies of keys emmited from it (in one case up to 62). The curious thing is I can't get this to happen

Spark streaming on data at rest.

2014-10-16 Thread ameyc
Apologies if this is something very obvious but I've perused the spark streaming guide and this isn't very evident to me still. So I have files with data of the format: timestamp,column1,column2,column3.. etc. and I'd like to use the spark streaming's window operations on them. However from what

Re: EC2 cluster set up and access to HBase in a different cluster

2014-10-16 Thread freedafeng
Maybe I should create a private AMI to use for my question No.1? Assuming I use the default instance type as the base image.. Anyone tried this? -- View this message in context:

Print dependency graph as DOT file

2014-10-16 Thread Soumitra Kumar
Hello, Is there a way to print the dependency graph of complete program or RDD/DStream as a DOT file? It would be very helpful to have such a thing. Thanks, -Soumitra. - To unsubscribe, e-mail:

Re: Play framework

2014-10-16 Thread Manu Suryavansh
Hi, Below is the link for a simple Play + SparkSQL example - http://blog.knoldus.com/2014/07/14/play-with-spark-building-apache-spark-with-play-framework-part-3/ https://github.com/knoldus/Play-Spark-Scala Manu On Thu, Oct 16, 2014 at 1:00 PM, Mohammed Guller moham...@glassbeam.com wrote:

local class incompatible: stream classdesc serialVersionUID

2014-10-16 Thread Pat Ferrel
I’ve read several discussions of the error here and so have wiped all cluster machines and copied the master’s spark build to the rest of the cluster. I’ve built my job on the master using the correct Spark version as a dependency and even build that version of Spark. I still get the

Join with large data set

2014-10-16 Thread Ankur Srivastava
Hi, I have a rdd which is my application data and is huge. I want to join this with reference data which is also huge to fit in-memory and thus I do not want to use Broadcast variable. What other options do I have to perform such joins? I am using Cassandra as my data store, so should I just

Spark Hive Snappy Error

2014-10-16 Thread arthur.hk.c...@gmail.com
Hi, When trying Spark with Hive table, I got the “java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I” error, val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql(“select count(1) from q8_national_market_share sqlContext.sql(select

Re: local class incompatible: stream classdesc serialVersionUID

2014-10-16 Thread Pat Ferrel
Yes, I removed my Spark dir and scp’ed the master’s build to all cluster machines suspecting that problem. My app (Apache Mahout) had Spark 1.0.1 in the POM but changing it to 1.0.2 (the Spark version installed) gave another error. I guess I’ll have to install Spark 1.0.1 or get Mahout to

Re: scala: java.net.BindException?

2014-10-16 Thread Duy Huynh
thanks marcelo. i only instantiated sparkcontext once, at the beginning, in this code. the exception was thrown right at the beginning. i also tried to run other programs, which worked fine previously, but now also got the same error. it looks like it put global block on creating sparkcontext

Exception Logging

2014-10-16 Thread Ge, Yao (Y.)
I need help to better trap Exception in the map functions. What is the best way to catch the exception and provide some helpful diagnostic information such as source of the input such as file name (and ideally line number if I am processing a text file)? -Yao

object in an rdd: serializable?

2014-10-16 Thread ll
i got an exception complaining about serializable. the sample code is below... class HelloWorld(val count: Int) { ... ... } object Test extends App { ... val data = sc.parallelize(List(new HelloWorld(1), new HelloWorld(2))) ... } what is the best way to serialize HelloWorld so that

Re: Exception Logging

2014-10-16 Thread Yana Kadiyska
you can out a try catch block in the map function and log the exception. The only tricky part is that the exception log will be located in the executor machine. Even if you don't do any trapping you should see the exception stacktrace in the executors' stderr log which is visible through the UI

RE: Play framework

2014-10-16 Thread Mohammed Guller
Manu, I had looked at that example before starting this thread. I was specifically looking for some suggestions on how to run a Play app with the Spark-submit script on a real cluster. Mohammed From: Manu Suryavansh [mailto:suryavanshi.m...@gmail.com] Sent: Thursday, October 16, 2014 3:32 PM

Re: [SparkSQL] Convert JavaSchemaRDD to SchemaRDD

2014-10-16 Thread Earthson
I'm trying to give API interface to Java users. And I need to accept their JavaSchemaRDDs, and convert it to SchemaRDD for Scala users. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Convert-JavaSchemaRDD-to-SchemaRDD-tp16482p16641.html Sent from

RE: Spark Hive Snappy Error

2014-10-16 Thread Shao, Saisai
Hi Arthur, I think this is a known issue in Spark, you can check (https://issues.apache.org/jira/browse/SPARK-3958). I’m curious about it, can you always reproduce this issue, Is this issue related to some specific data sets, would you mind giving me some information about you workload, Spark

how to build spark 1.1.0 to include org.apache.commons.math3 ?

2014-10-16 Thread Henry Hung
HI All, I try to build spark 1.1.0 using sbt with command: sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly but the resulting spark-assembly-1.1.0-hadoop2.2.0.jar still missing the apache commons math3 classes. How to add the math3 into package? Best regards, Henry

Re: Play framework

2014-10-16 Thread US Office Admin
The remaining dependencies (Spark libraries) are available for the context from the sparkhome. I have installed spark such that all the slaves to have same sparkhome. Code looks like this. val conf = new SparkConf() .setSparkHome(/home/dev/spark) .setMaster(spark://99.99.99.999:7077)

RE: Play framework

2014-10-16 Thread Mohammed Guller
What about all the play dependencies since the jar created by the ‘Play package’ won’t include the play jar or any of the 100+ jars on which play itself depends? Mohammed From: US Office Admin [mailto:ad...@vectorum.com] Sent: Thursday, October 16, 2014 7:05 PM To: Mohammed Guller;

Re: object in an rdd: serializable?

2014-10-16 Thread Boromir Widas
make it a case class should work. On Thu, Oct 16, 2014 at 8:30 PM, ll duy.huynh@gmail.com wrote: i got an exception complaining about serializable. the sample code is below... class HelloWorld(val count: Int) { ... ... } object Test extends App { ... val data =

Re: Play framework

2014-10-16 Thread Ramaraju Indukuri
In our case, Play libraries are not required to run spark jobs. Hence they are available only on master and play runs as a regular scala application. I can't think of a case where you need play to run on slaves. Raju On Thu, Oct 16, 2014 at 10:21 PM, Mohammed Guller moham...@glassbeam.com

Re: spark1.0 principal component analysis

2014-10-16 Thread Xiangrui Meng
computePrincipalComponents returns a local matrix X, whose columns are the principal components (ordered), while those column vectors are in the same feature space as the input feature vectors. -Xiangrui On Thu, Oct 16, 2014 at 2:39 AM, al123 ant.lay...@hotmail.co.uk wrote: Hi, I don't think

Re: How to close resources shared in executor?

2014-10-16 Thread Fengyun RAO
Thanks, Ted, We use CDH 5.1 and the HBase version is 0.98.1-cdh5.1.0, in which the javadoc of HConnectionManager.java still recommends shutdown hook. I look into val table = Util.Connection.getTable(user), and find it didn't invoke public HTable(Configuration conf, final byte[] tableName, final

Re: How to close resources shared in executor?

2014-10-16 Thread Ted Yu
Looking at Apache 0.98 code, you can follow the example in the class javadoc (line 144 of HConnectionManager.java): * HTableInterface table = connection.getTable(table1); * try { * // Use the table as needed, for a single operation and a single thread * } finally { * table.close(); *

error when maven build spark 1.1.0 with message You have 1 Scalastyle violation

2014-10-16 Thread Henry Hung
Hi All, I'm using windows 8.1 to build spark 1.1.0 using this command: C:\apache-maven-3.0.5\bin\mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package -e Below is the error message: [ERROR] Failed to execute goal org.scalastyle:scalastyle-maven-plugin:0.4.0:check (default)

  1   2   >