Re: Not able to update collections

2015-02-24 Thread Vijayasarathy Kannan
I am a beginner to Scala/Spark. Could you please elaborate on how to make RDD of results of func() and collect? On Tue, Feb 24, 2015 at 2:27 PM, Sean Owen so...@cloudera.com wrote: They aren't the same 'lst'. One is on your driver. It gets copied to executors when the tasks are executed.

Re: Movie Recommendation tutorial

2015-02-24 Thread Sean Owen
It's something like the average error in rating, but a bit different -- it's the square root of average squared error. But if you think of the ratings as 'stars' you could kind of think of 0.86 as 'generally off by 0.86' stars and that would be somewhat right. Whether that's good depends on what

Re: Not able to update collections

2015-02-24 Thread Sean Owen
Instead of ...foreach { edgesBySrc = { lst ++= func(edgesBySrc) } } try ...flatMap { edgesBySrc = func(edgesBySrc) } or even more succinctly ...flatMap(func) This returns an RDD that basically has the list you are trying to build, I believe. You can collect() to the driver but

Re: Not able to update collections

2015-02-24 Thread Vijayasarathy Kannan
Thanks, but it still doesn't seem to work. Below is my entire code. var mp = scala.collection.mutable.Map[VertexId, Int]() var myRdd = graph.edges.groupBy[VertexId](f).flatMap { edgesBySrc = func(edgesBySrc, a, b) } myRdd.foreach { node = { mp(node) = 1 } }

Re: Movie Recommendation tutorial

2015-02-24 Thread Krishna Sankar
Yep, much better with 0.1. The best model was trained with rank = 12 and lambda = 0.1, and numIter = 20, and its RMSE on the test set is 0.869092 (Spark 1.3.0) Question : What is the intuition behind RSME of 0.86 vs 1.3 ? I know the smaller the better. But is it that better ? And what is a good

Re: Missing shuffle files

2015-02-24 Thread Anders Arpteg
If you thinking of the yarn memory overhead, then yes, I have increased that as well. However, I'm glad to say that my job finished successfully finally. Besides the timeout and memory settings, performing repartitioning (with shuffling) at the right time seems to be the key to make this large job

Running multiple threads with same Spark Context

2015-02-24 Thread Harika
Hi all, I have been running a simple SQL program on Spark. To test the concurrency, I have created 10 threads inside the program, all threads using same SQLContext object. When I ran the program on my EC2 cluster using spark-submit, only 3 threads were running in parallel. I have repeated the

Re: RDD String foreach println

2015-02-24 Thread Sean Owen
println occurs on the machine where the task executes, which may or may not be the same as your local driver process. collect()-ing brings data back to the driver, so printing there definitely occurs on the driver. On Tue, Feb 24, 2015 at 9:48 AM, patcharee patcharee.thong...@uni.no wrote: Hi,

Re: HiveContext in SparkSQL - concurrency issues

2015-02-24 Thread Harika
Hi Sreeharsha, My data is in HDFS. I am trying to use Spark HiveContext (instead of SQLContext) to fire queries on my data just because HiveContext supports more operations. Sreeharsha wrote Change derby to mysql and check once me to faced the same issue I am pretty new to Spark and

updateStateByKey and invFunction

2015-02-24 Thread Ashish Sharma
So say I want to calculate top K users visiting a page in the past 2 hours updated every 5 mins. so here I want to maintain something like this Page_01 = {user_01:32, user_02:3, user_03:7...} ... Basically a count of number of times a user visited a page. Here my key is page name/id and state

Re: updateStateByKey and invFunction

2015-02-24 Thread Arush Kharbanda
You can use a reduceByKeyAndWindow with your specific time window. You can specify the inverse function in reduceByKeyAndWindow. On Tue, Feb 24, 2015 at 1:36 PM, Ashish Sharma ashishonl...@gmail.com wrote: So say I want to calculate top K users visiting a page in the past 2 hours updated every

Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-24 Thread anu
My issue is posted here on stack-overflow. What am I doing wrong here? http://stackoverflow.com/questions/28689186/facing-error-while-extending-scala-class-with-product-interface-to-overcome-limi -- View this message in context:

Re: On app upgrade, restore sliding window data.

2015-02-24 Thread Arush Kharbanda
I think this could be of some help to you. https://issues.apache.org/jira/browse/SPARK-3660 On Tue, Feb 24, 2015 at 2:18 AM, Matus Faro matus.f...@kik.com wrote: Hi, Our application is being designed to operate at all times on a large sliding window (day+) of data. The operations

RDD String foreach println

2015-02-24 Thread patcharee
Hi, I would like to print the content of RDD[String]. I tried 1) linesWithSpark.foreach(println) 2) linesWithSpark.collect().foreach(println) I submitted the job by spark-submit. 1) did not print, but 2) did. But when I used the shell, both 1) and 2) printed. Any ideas why 1) behaves

Re: Getting to proto buff classes in Spark Context

2015-02-24 Thread Sean Owen
I assume this is a difference between your local driver classpath and remote worker classpath. It may not be a question of whether the class is there, but classpath visibility issues. Have you looked into settings like spark.files.userClassPathFirst? On Tue, Feb 24, 2015 at 4:43 AM, necro351 .

Re: Re: Many Receiver vs. Many threads per Receiver

2015-02-24 Thread bit1...@163.com
Thanks Akhil. Not sure whether thelowlevel consumer.will be officially supported by Spark Streaming. So far, I don't see it mentioned/documented in the spark streaming programming guide. bit1...@163.com From: Akhil Das Date: 2015-02-24 16:21 To: bit1...@163.com CC: user Subject: Re: Many

Re: Use case for data in SQL Server

2015-02-24 Thread Cheng Lian
There is a newly introduced JDBC data source in Spark 1.3.0 (not the JdbcRDD in Spark core), which may be useful. However, currently there's no SQL server specific logics implemented. I'd assume standard SQL queries should work. Cheng On 2/24/15 7:02 PM, Suhel M wrote: Hey, I am trying to

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-24 Thread Akhil Das
Did you happen to have a look at https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema Thanks Best Regards On Tue, Feb 24, 2015 at 3:39 PM, anu anamika.guo...@gmail.com wrote: My issue is posted here on stack-overflow. What am I doing wrong

New guide on how to write a Spark job in Clojure

2015-02-24 Thread Christian Betz
Hi all, Maybe some of you are interested: I wrote a new guide on how to start using Spark from Clojure. The tutorial covers * setting up a project, * doing REPL- or Test Driven Development of Spark jobs * Running Spark jobs locally. Just read it on

Spark on EC2

2015-02-24 Thread Deep Pradhan
Hi, I have just signed up for Amazon AWS because I learnt that it provides service for free for the first 12 months. I want to run Spark on EC2 cluster. Will they charge me for this? Thank You

Re: Spark on EC2

2015-02-24 Thread Sean Owen
The free tier includes 750 hours of t2.micro instance time per month. http://aws.amazon.com/free/ That's basically a month of hours, so it's all free if you run one instance only at a time. If you run 4, you'll be able to run your cluster of 4 for about a week free. A t2.micro has 1GB of memory,

Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
Thank You Sean. I was just trying to experiment with the performance of Spark Applications with various worker instances (I hope you remember that we discussed about the worker instances). I thought it would be a good one to try in EC2. So, it doesn't work out, does it? Thank You On Tue, Feb 24,

Re: Pyspark save Decison Tree Module with joblib/pickle

2015-02-24 Thread Sebastián Ramírez
Great to know, thanks Xiangrui. *Sebastián Ramírez* Diseñador de Algoritmos http://www.senseta.com Tel: (+571) 795 7950 ext: 1012 Cel: (+57) 300 370 77 10 Calle 73 No 7 - 06 Piso 4 Linkedin: co.linkedin.com/in/tiangolo/ Twitter: @tiangolo https://twitter.com/tiangolo

Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
No, I think I am ok with the time it takes. Just that, with the increase in the partitions along with the increase in the number of workers, I want to see the improvement in the performance of an application. I just want to see this happen. Any comments? Thank You On Tue, Feb 24, 2015 at 8:52

Re: Sharing Spark Drivers

2015-02-24 Thread Chip Senkbeil
Hi John, This would be a potential application for the Spark Kernel project ( https://github.com/ibm-et/spark-kernel). The Spark Kernel serves as your driver application, allowing you to feed it snippets of code (or load up entire jars via magics) in Scala to execute against a Spark cluster.

Re: Spark on EC2

2015-02-24 Thread gen tang
Hi, I am sorry that I made a mistake on AWS tarif. You can read the email of sean owen which explains better the strategies to run spark on AWS. For your question: it means that you just download spark and unzip it. Then run spark shell by ./bin/spark-shell or ./bin/pyspark. It is useful to get

Re: updateStateByKey and invFunction

2015-02-24 Thread Ashish Sharma
But how will I specify my state there? On Tue, Feb 24, 2015 at 12:50 AM Arush Kharbanda ar...@sigmoidanalytics.com wrote: You can use a reduceByKeyAndWindow with your specific time window. You can specify the inverse function in reduceByKeyAndWindow. On Tue, Feb 24, 2015 at 1:36 PM, Ashish

Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
Thank You Akhil. Will look into it. Its free, isn't it? I am still a student :) On Tue, Feb 24, 2015 at 9:06 PM, Akhil Das ak...@sigmoidanalytics.com wrote: If you signup for Google Compute Cloud, you will get free $300 credits for 3 months and you can start a pretty good cluster for your

Re: Use case for data in SQL Server

2015-02-24 Thread Denny Lee
Hi Suhel, My team is currently working with a lot of SQL Server databases as one of our many data sources and ultimately we pull the data into HDFS from SQL Server. As we had a lot of SQL databases to hit, we used the jTDS driver and SQOOP to extract the data out of SQL Server and into HDFS

Re: Spark on EC2

2015-02-24 Thread Sean Owen
You can definitely, easily, try a 1-node standalone cluster for free. Just don't be surprised when the CPU capping kicks in within about 5 minutes of any non-trivial computation and suddenly the instance is very s-l-o-w. I would consider just paying the ~$0.07/hour to play with an m3.medium,

Re: Spark on EC2

2015-02-24 Thread Charles Feduke
This should help you understand the cost of running a Spark cluster for a short period of time: http://www.ec2instances.info/ If you run an instance for even 1 second of a single hour you are charged for that complete hour. So before you shut down your miniature cluster make sure you really are

Re: Spark on EC2

2015-02-24 Thread Akhil Das
If you signup for Google Compute Cloud, you will get free $300 credits for 3 months and you can start a pretty good cluster for your testing purposes. :) Thanks Best Regards On Tue, Feb 24, 2015 at 8:25 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have just signed up for Amazon AWS

Re: Spark on EC2

2015-02-24 Thread Akhil Das
Yes it is :) Thanks Best Regards On Tue, Feb 24, 2015 at 9:09 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Thank You Akhil. Will look into it. Its free, isn't it? I am still a student :) On Tue, Feb 24, 2015 at 9:06 PM, Akhil Das ak...@sigmoidanalytics.com wrote: If you signup for

Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
Kindly bear with my questions as I am new to this. If you run spark on local mode on a ec2 machine What does this mean? Is it that I launch Spark cluster from my local machine,i.e., by running the shell script that is there in /spark/ec2? On Tue, Feb 24, 2015 at 8:32 PM, gen tang

Re: Spark on EC2

2015-02-24 Thread gen tang
Hi, As a real spark cluster needs a least one master and one slaves, you need to launch two machine. Therefore the second machine is not free. However, If you run spark on local mode on a ec2 machine. It is free. The charge of AWS depends on how much and the types of machine that you launched,

Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
Thank You All. I think I will look into paying ~$0.7/hr as Sean suggested. On Tue, Feb 24, 2015 at 9:01 PM, gen tang gen.tan...@gmail.com wrote: Hi, I am sorry that I made a mistake on AWS tarif. You can read the email of sean owen which explains better the strategies to run spark on AWS.

Re: Accumulator in SparkUI for streaming

2015-02-24 Thread Petar Zecevic
Interesting. Accumulators are shown on Web UI if you are using the ordinary SparkContext (Spark 1.2). It just has to be named (and that's what you did). scala val acc = sc.accumulator(0, test accumulator) acc: org.apache.spark.Accumulator[Int] = 0 scala val rdd = sc.parallelize(1 to 1000)

Re: Memory problems when calling pipe()

2015-02-24 Thread Juan Rodríguez Hortalá
Hi, I finally solved the problem by setting spark.yarn.executor.memoryOverhead with the option --conf spark.yarn.executor.memoryOverhead= for spark-submit, as pointed out in http://stackoverflow.com/questions/28404714/yarn-why-doesnt-task-go-out-of-heap-space-but-container-gets-killed and

Sharing Spark Drivers

2015-02-24 Thread John Omernik
I have been posting on the Mesos list, as I am looking to see if it it's possible or not to share spark drivers. Obviously, in stand alone cluster mode, the Master handles requests, and you can instantiate a new sparkcontext to a currently running master. However in Mesos (and perhaps Yarn) I

Spark 1.3 dataframe documentation

2015-02-24 Thread poiuytrez
Hello, I have built Spark 1.3. I can successfully use the dataframe api. However, I am not able to find its api documentation in Python. Do you know when the documentation will be available? Best Regards, poiuytrez -- View this message in context:

Re: Get filename in Spark Streaming

2015-02-24 Thread Emre Sevinc
Hello Subacini, Until someone more knowledgeable suggests a better, more straightforward, and simpler approach with a working code snippet, I suggest the following workaround / hack: inputStream.foreachRDD(rdd = val myStr = rdd.toDebugString // process myStr string value, e.g. using

Re: Executor size and checkpoints

2015-02-24 Thread Yana Kadiyska
Tathagata, yes, I was using StreamingContext.getOrCreate. My question is about the design decision here. I was expecting that if I have a streaming application that say crashed, and I wanted to give the executors more memory, I would be able to restart, using the checkpointed RDD but with more

Running out of space (when there's no shortage)

2015-02-24 Thread Joe Wass
I'm running a cluster of 3 Amazon EC2 machines (small number because it's expensive when experiments keep crashing after a day!). Today's crash looks like this (stacktrace at end of message). org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 On my

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-24 Thread Shuai Zheng
Hi Imran, I will say your explanation is extremely helpful J I tested some ideas according to your explanation and it make perfect sense to me. I modify my code to use cogroup+mapValues instead of union+reduceByKey to preserve the partition, which gives me more than 100% performance gain

Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-24 Thread Christophe Préaud
Hi Colin, Here is how I have configured my hadoop cluster to have yarn logs available through both the yarn CLI and the _yarn_ history server (with gzip compression and 10 days retention): 1. Add the following properties in the yarn-site.xml on each node managers and on the resource manager:

Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-24 Thread Colin Kincaid Williams
Looks like in my tired state, I didn't mention spark the whole time. However, it might be implied by the application log above. Spark log aggregation appears to be working, since I can run the yarn command above. I do have yarn logging setup for the yarn history server. I was trying to use the

Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-24 Thread Imran Rashid
the spark history server and the yarn history server are totally independent. Spark knows nothing about yarn logs, and vice versa, so unfortunately there isn't any way to get all the info in one place. On Tue, Feb 24, 2015 at 12:36 PM, Colin Kincaid Williams disc...@uw.edu wrote: Looks like in

Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-24 Thread Colin Kincaid Williams
So back to my original question. I can see the spark logs using the example above: yarn logs -applicationId application_1424740955620_0009 This shows yarn log aggregation working. I can see the std out and std error in that container information above. Then how can I get this information in a

Re: Brodcast Variable updated from one transformation and used from another

2015-02-24 Thread Yiannis Gkoufas
Sorry for the mistake, I actually have it this way: val myObject = new MyObject(); val myObjectBroadcasted = sc.broadcast(myObject); val rdd1 = sc.textFile(/file1).map(e = { myObjectBroadcasted.value.insert(e._1); (e._1,1) }); rdd.cache.count(); //to make sure it is transformed. val rdd2 =

[SparkSQL] Number of map tasks in SparkSQL

2015-02-24 Thread Yana Kadiyska
Shark used to have shark.map.tasks variable. Is there an equivalent for Spark SQL? We are trying a scenario with heavily partitioned Hive tables. We end up with a UnionRDD with a lot of partitions underneath and hence too many tasks:

RE: Brodcast Variable updated from one transformation and used from another

2015-02-24 Thread Ganelin, Ilya
You're not using the broadcasted variable within your map operations. You're attempting to modify myObjrct directly which won't work because you are modifying the serialized copy on the executor. You want to do myObjectBroadcasted.value.insert and myObjectBroadcasted.value.lookup. Sent with

Brodcast Variable updated from one transformation and used from another

2015-02-24 Thread Yiannis Gkoufas
Hi all, I am trying to do the following. val myObject = new MyObject(); val myObjectBroadcasted = sc.broadcast(myObject); val rdd1 = sc.textFile(/file1).map(e = { myObject.insert(e._1); (e._1,1) }); rdd.cache.count(); //to make sure it is transformed. val rdd2 = sc.textFile(/file2).map(e = {

Re: Sharing Spark Drivers

2015-02-24 Thread John Omernik
I am aware of that, but two things are working against me here with spark-kernel. Python is our language, and we are really looking for a supported way to approach this for the enterprise. I like the concept, it just doesn't work for us given our constraints. This does raise an interesting point

Re: Running multiple threads with same Spark Context

2015-02-24 Thread Yana Kadiyska
It's hard to tell. I have not run this on EC2 but this worked for me: The only thing that I can think of is that the scheduling mode is set to - *Scheduling Mode:* FAIR val pool: ExecutorService = Executors.newFixedThreadPool(poolSize) while_loop to get curr_job pool.execute(new

throughput in the web console?

2015-02-24 Thread Josh J
Hi, I plan to run a parameter search varying the number of cores, epoch, and parallelism. The web console provides a way to archive the previous runs, though is there a way to view in the console the throughput? Rather than logging the throughput separately to the log files and correlating the

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Vladimir Rodionov
Usually it happens in Linux when application deletes file w/o double checking that there are no open FDs (resource leak). In this case, Linux holds all space allocated and does not release it until application exits (crashes in your case). You check file system and everything is normal, you have

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Ted Yu
Here is a tool which may give you some clue: http://file-leak-detector.kohsuke.org/ Cheers On Tue, Feb 24, 2015 at 11:04 AM, Vladimir Rodionov vrodio...@splicemachine.com wrote: Usually it happens in Linux when application deletes file w/o double checking that there are no open FDs (resource

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Yiannis Gkoufas
Hi there, I assume you are using spark 1.2.1 right? I faced the exact same issue and switched to 1.1.1 with the same configuration and it was solved. On 24 Feb 2015 19:22, Ted Yu yuzhih...@gmail.com wrote: Here is a tool which may give you some clue: http://file-leak-detector.kohsuke.org/

Re: Not able to update collections

2015-02-24 Thread Sean Owen
They aren't the same 'lst'. One is on your driver. It gets copied to executors when the tasks are executed. Those copies are updated. But the updates will never reflect in the local copy back in the driver. You may just wish to make an RDD of the results of func() and collect() them back to the

Re: [ML][SQL] Select UserDefinedType attribute in a DataFrame

2015-02-24 Thread Xiangrui Meng
If you make `Image` a case class, then select(image.data) should work. On Tue, Feb 24, 2015 at 3:06 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I have a DataFrame that contains a user defined type. The type is an image with the following attribute class Image(w: Int, h: Int,

SparkStreaming failing with exception Could not compute split, block input

2015-02-24 Thread Mukesh Jha
Hi Experts, My Spark Job is failing with below error. From the logs I can see that input-3-1424842351600 was added at 5:32:32 and was never purged out of memory. Also the available free memory for the executor is *2.1G*. Please help me figure out why executors cannot fetch this input. Txz for

Re: Not able to update collections

2015-02-24 Thread Shixiong Zhu
Rdd.foreach runs in the executors. You should use `collect` to fetch data to the driver. E.g., myRdd.collect().foreach { node = { mp(node) = 1 } } Best Regards, Shixiong Zhu 2015-02-25 4:00 GMT+08:00 Vijayasarathy Kannan kvi...@vt.edu: Thanks, but it still doesn't seem to

Re: Unable to run hive queries inside spark

2015-02-24 Thread kundan kumar
Hi Denny, yes the user has all the rights to HDFS. I am running all the spark operations with this user. and my hive-site.xml looks like this property namehive.metastore.warehouse.dir/name value/user/hive/warehouse/value descriptionlocation of default database for the

Re: spark streaming: stderr does not roll

2015-02-24 Thread Mukesh Jha
I'm also facing the same issue. I tried the configurations but it seems the executors spark's log4j.properties seems to override the passed values, so you have to change /etc/spark/conf/log4j.properties. Let me know if any of you have managed to get this fixes programatically. I am planning to

Re: Spark excludes fastutil dependencies we need

2015-02-24 Thread Ted Yu
bq. depend on missing fastutil classes like Long2LongOpenHashMap Looks like Long2LongOpenHashMap should be added to the shaded jar. Cheers On Tue, Feb 24, 2015 at 7:36 PM, Jim Kleckner j...@cloudphysics.com wrote: Spark includes the clearspring analytics package but intentionally excludes

Help vote for Spark talks at the Hadoop Summit

2015-02-24 Thread Reynold Xin
Hi all, The Hadoop Summit uses community choice voting to decide which talks to feature. It would be great if the community could help vote for Spark talks so that Spark has a good showing at this event. You can make three votes on each track. Below I've listed 3 talks that are important to

Re: Unable to run hive queries inside spark

2015-02-24 Thread Denny Lee
The error message you have is: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:file:/user/hive/warehouse/src is not a directory or unable to create one) Could you verify that you (the user you are running under) has the rights to create

used cores are less then total no. of core

2015-02-24 Thread Somnath Pandeya
Hi All, I am running a simple word count example of spark (standalone cluster) , In the UI it is showing For each worker no. of cores available are 32 ,but while running the jobs only 5 cores are being used, What should I do to increase no. of used core or it is selected based on jobs. Thanks

Re: Unable to run hive queries inside spark

2015-02-24 Thread Denny Lee
That's all you should need to do. Saying this, I did run into an issue similar to this when I was switching Spark versions which were tied to different default Hive versions (eg Spark 1.3 by default works with Hive 0.13.1). I'm wondering if you may be hitting this issue due to that? On Tue, Feb

Re: Executors dropping all memory stored RDDs?

2015-02-24 Thread Thomas Gerber
I have a strong suspicion that it was caused by a disk full on the executor. I am not sure if the executor was supposed to recover that way from it. I cannot be sure about it, I should have had enough disk space, but I think I had some data skew which could have lead to some executor to run out

Re: used cores are less then total no. of core

2015-02-24 Thread VISHNU SUBRAMANIAN
Try adding --total-executor-cores 5 , where 5 is the number of cores. Thanks, Vishnu On Wed, Feb 25, 2015 at 11:52 AM, Somnath Pandeya somnath_pand...@infosys.com wrote: Hi All, I am running a simple word count example of spark (standalone cluster) , In the UI it is showing For each

Re: used cores are less then total no. of core

2015-02-24 Thread Akhil Das
You can set the following in the conf while creating the SparkContext (if you are not using spark-submit) .set(spark.cores.max, 32) Thanks Best Regards On Wed, Feb 25, 2015 at 11:52 AM, Somnath Pandeya somnath_pand...@infosys.com wrote: Hi All, I am running a simple word count

Re: Cannot access Spark web UI

2015-02-24 Thread Mukesh Jha
My Hadoop version is Hadoop 2.5.0-cdh5.3.0 From the Driver logs [3] I can see that SparkUI started on a specified port, also my YARN app tracking URL[1] points to that port which is in turn getting redirected to the proxy URL[2] which gives me java.net.BindException: Cannot assign requested

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Yiannis Gkoufas
No problem, Joe. There you go https://issues.apache.org/jira/browse/SPARK-5081 And also there is this one https://issues.apache.org/jira/browse/SPARK-5715 which is marked as resolved On 24 February 2015 at 21:51, Joe Wass jw...@crossref.org wrote: Thanks everyone. Yiannis, do you know if

Re: reduceByKey vs countByKey

2015-02-24 Thread Jey Kottalam
Hi Sathish, The current implementation of countByKey uses reduceByKey: https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L332 It seems that countByKey is mostly deprecated: https://issues.apache.org/jira/browse/SPARK-3994 -Jey On Tue,

Re: Spark 1.3 dataframe documentation

2015-02-24 Thread Reynold Xin
The official documentation will be posted when 1.3 is released (early March). Right now, you can build the docs yourself by running jekyll build in docs. Alternatively, just look at dataframe,py as Ted pointed out. On Tue, Feb 24, 2015 at 6:56 AM, Ted Yu yuzhih...@gmail.com wrote: Have you

Re: How to start spark-shell with YARN?

2015-02-24 Thread Xi Shen
Hi Sean, I launched the spark-shell on the same machine as I started YARN service. I don't think port will be an issue. I am new to spark. I checked the HDFS web UI and the YARN web UI. But I don't know how to check the AM. Can you help? Thanks, David On Tue, Feb 24, 2015 at 8:37 PM Sean

Re: [ML][SQL] Select UserDefinedType attribute in a DataFrame

2015-02-24 Thread Xiangrui Meng
Btw, the correct syntax for alias should be `df.select($image.data.as(features))`. On Tue, Feb 24, 2015 at 3:35 PM, Xiangrui Meng men...@gmail.com wrote: If you make `Image` a case class, then select(image.data) should work. On Tue, Feb 24, 2015 at 3:06 PM, Jaonary Rabarisoa jaon...@gmail.com

Re: New guide on how to write a Spark job in Clojure

2015-02-24 Thread Reynold Xin
Thanks for sharing, Chris. On Tue, Feb 24, 2015 at 4:39 AM, Christian Betz christian.b...@performance-media.de wrote: Hi all, Maybe some of you are interested: I wrote a new guide on how to start using Spark from Clojure. The tutorial covers - setting up a project, - doing REPL-

Fair Scheduler Pools

2015-02-24 Thread pnpritchard
Hi, I am trying to use the fair scheduler pools (http://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools) to schedule two jobs at the same time. In my simple example, I have configured spark in local mode with 2 cores (local[2]). I have also configured two pools in

Re: Add PredictionIO to Powered by Spark

2015-02-24 Thread Patrick Wendell
Added - thanks! I trimmed it down a bit to fit our normal description length. On Mon, Jan 5, 2015 at 8:24 AM, Thomas Stone tho...@prediction.io wrote: Please can we add PredictionIO to https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark PredictionIO http://prediction.io/

Re: [ML][SQL] Select UserDefinedType attribute in a DataFrame

2015-02-24 Thread Jaonary Rabarisoa
By doing so, I got the following error : Exception in thread main org.apache.spark.sql.AnalysisException: GetField is not valid on fields Seems that it doesn't like image.data expression. On Wed, Feb 25, 2015 at 12:37 AM, Xiangrui Meng men...@gmail.com wrote: Btw, the correct syntax for alias

Re: Spark 1.3 dataframe documentation

2015-02-24 Thread Davies Liu
Another way to see the Python docs: $ export PYTHONPATH=$SPARK_HOME/python $ pydoc pyspark.sql On Tue, Feb 24, 2015 at 2:01 PM, Reynold Xin r...@databricks.com wrote: The official documentation will be posted when 1.3 is released (early March). Right now, you can build the docs yourself by

Re: Can you add Big Industries to the Powered by Spark page?

2015-02-24 Thread Patrick Wendell
I've added it, thanks! On Fri, Feb 20, 2015 at 12:22 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, Could you please add Big Industries to the Powered by Spark page at https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark ? Company Name: Big Industries URL:

Re: Can you add Big Industries to the Powered by Spark page?

2015-02-24 Thread Emre Sevinc
Hello, Thanks for adding, but URL seems to have a typo: when I click it tries to open http//www.bigindustries.be/ But it should be: http://www.bigindustries.be/ Kind regards, Emre Sevinç http://http//www.bigindustries.be/ On Feb 25, 2015 12:29 AM, Patrick Wendell pwend...@gmail.com wrote:

reduceByKey vs countByKey

2015-02-24 Thread Sathish Kumaran Vairavelu
Hello, Quick question. I am trying to understand difference between reduceByKey vs countByKey? Which one gives better performance reduceByKey or countByKey? While we can perform same count operation using reduceByKey why we need countByKey/countByValue? Thanks Sathish

[ML][SQL] Select UserDefinedType attribute in a DataFrame

2015-02-24 Thread Jaonary Rabarisoa
Hi all, I have a DataFrame that contains a user defined type. The type is an image with the following attribute *class Image(w: Int, h: Int, data: Vector)* In my DataFrame, images are stored in column named image that corresponds to the following case class *case class LabeledImage(label: Int,

Re: How to start spark-shell with YARN?

2015-02-24 Thread Denny Lee
It may have to do with the akka heartbeat interval per SPARK-3923 - https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-3923 ? On Tue, Feb 24, 2015 at 16:40 Xi Shen davidshe...@gmail.com wrote: Hi Sean, I launched the spark-shell on the same machine as I started YARN service. I

Unable to run hive queries inside spark

2015-02-24 Thread kundan kumar
Hi , I have placed my hive-site.xml inside spark/conf and i am trying to execute some hive queries given in the documentation. Can you please suggest what wrong am I doing here. scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext:

Spark excludes fastutil dependencies we need

2015-02-24 Thread Jim Kleckner
Spark includes the clearspring analytics package but intentionally excludes the dependencies of the fastutil package (see below). Spark includes parquet-column which includes fastutil and relocates it under parquet/ but creates a shaded jar file which is incomplete because it shades out some of

Task not serializable exception

2015-02-24 Thread Kartheek.R
Hi, I run into Task not Serializable excption with following code below. When I remove the threads and run, it works, but with threads I run into Task not serializable exception. object SparkKart extends Serializable{ def parseVector(line: String): Vector[Double] = { DenseVector(line.split('

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-24 Thread anamika gupta
Hi Akhil I guess it skipped my attention. I would definitely give it a try. While I would still like to know what is the issue with the way I have created schema? On Tue, Feb 24, 2015 at 4:35 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you happen to have a look at

Re: Is Ubuntu server or desktop better for spark cluster

2015-02-24 Thread Sebastián Ramírez
Check out the FAQ in the link by Deepak Vohra. The main differences are that the desktop installation includes common user's packages, as LibreOffice, while the server installation doesn't. But the server includes server user's packages, as apache2. Also, the Desktop has a GUI (a graphical

EventLog / Timeline calculation - Optimization

2015-02-24 Thread syepes
Hello, For the past days I have been trying to process and analyse with Spark a Cassandra eventLog table similar to the one shown here. Basically what I want to calculate is the delta time epoch between each event type for all the device id's in the table. Currently its working as expected but I