Re: API to add/remove containers inside an application

2014-09-05 Thread Praveen Seluka
Mailed our list - will send it to Spark Dev On Fri, Sep 5, 2014 at 11:28 AM, Rajat Gupta rgu...@qubole.com wrote: +1 on this. First step to more automated autoscaling of spark application master... On Fri, Sep 5, 2014 at 12:56 AM, Praveen Seluka psel...@qubole.com wrote: +user On

NotSerializableException: org.apache.spark.sql.hive.api.java.JavaHiveContext

2014-09-05 Thread Bijoy Deb
Hello All, I am trying to query a Hive table using Spark SQL from my java code,but getting the following error: *Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException:

Re: EC2 - JNI crashes JVM with multi core instances

2014-09-05 Thread Iriasthor
Hi, thanks for the reply. Actually, I think you are correct about the native library not being thread safe. However, for what I understand, different cores should start different processes, being these independent from one another. If I am not mistaken, the thread safe issue would start being

Re: Serialize input path

2014-09-05 Thread Sean Owen
How about: val range = Range.getRange.toString val notWorking = path/output_{ + range +}/*/* On Fri, Sep 5, 2014 at 3:45 AM, jerryye jerr...@gmail.com wrote: Hi, I have a quick serialization issue. I'm trying to read a date range of input files and I'm getting a serialization issue when using

Re: Serialize input path

2014-09-05 Thread jerryye
Thanks for the response Sean. As a correction. The code I provided actually ended up working. I tried to reduce my code down but I was being overzealous and running count actually works. The minimal code that triggers the problem is this: val userProfiles = lines.map(line =

Re: Spark Streaming into HBase

2014-09-05 Thread Tathagata Das
Hmmm, even i dont understand why TheMain needs to be serializable. It might be cleaner (as in avoid such mysterious closure issues) to actually create a separate sbt/maven project (instead of a shell) and run the streaming application from there. TD On Thu, Sep 4, 2014 at 10:25 AM, kpeng1

PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Oleg Ruchovets
Hi , We avaluating PySpark and successfully executed examples of PySpark on Yarn. Next step what we want to do: We have a python project ( bunch of python script using Anaconda packages). Question: What is the way to execute PySpark on Yarn having a lot of python files ( ~ 50)?

Re: Multiple spark shell sessions

2014-09-05 Thread Andrew Ash
Hi Dhimant, We also cleaned up these needless warnings on port failover in Spark 1.1 -- see https://issues.apache.org/jira/browse/SPARK-1902 Andrew On Thu, Sep 4, 2014 at 7:38 AM, Dhimant dhimant84.jays...@gmail.com wrote: Thanks Yana, I am able to execute application and command via

Re: Mapping Hadoop Reduce to Spark

2014-09-05 Thread Lukas Nalezenec
Hi, FYI: There is bug in Java mapPartitions - SPARK-3369 https://issues.apache.org/jira/browse/SPARK-3369. In Java results from mapPartitions and similar functions must fit in memory. Look at example below - it returns List. Lukas On 1.9.2014 00:50, Matei Zaharia wrote: mapPartitions just

Re: Viewing web UI after fact

2014-09-05 Thread Grzegorz Białek
Hi Andrew, thank you very much for your answer. Unfortunately it still doesn't work. I'm using Spark 1.0.0, and I start history server running sbin/start-history-server.sh dir, although I also set SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory in conf/spark-env.sh. I tried also other dir

Recursion

2014-09-05 Thread Deep Pradhan
Hi, Does Spark support recursive calls?

question on replicate() in blockManager.scala

2014-09-05 Thread rapelly kartheek
Hi, var cachedPeers: Seq[BlockManagerId] = null private def replicate(blockId: String, data: ByteBuffer, level: StorageLevel) { val tLevel = StorageLevel(level.useDisk, level.useMemory, level.deserialized, 1) if (cachedPeers == null) { cachedPeers = master.getPeers(blockManagerId,

Spark that integrates with Kafka 0.7

2014-09-05 Thread Hemanth Yamijala
Hi, Due to some limitations, we are having to stick to Kafka 0.7.x. We would like to use as latest a version of Spark in streaming mode that integrates with Kafka 0.7. The latest version supports only 0.8 it appears. Has anyone solved such a requirement ? Any tips on what can be tried ? FWIW, I

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError GC overhead limit exceeded

2014-09-05 Thread Yifan LI
Thank you, Ankur! :) But how to assign the storage level to a new vertices RDD that mapped from an existing vertices RDD, e.g. *val newVertexRDD = graph.collectNeighborIds(EdgeDirection.Out).map{case(id:VertexId, a:Array[VertexId]) = (id, initialHashMap(a))}* the new one will be combined with

Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread Ognen Duzlevski
Is this possible? If i have a cluster set up on EC2 and I want to run spark-shell --master my master IP on EC2:7077 from my home computer - is this possible at all or am I wasting my time ;)? I am seeing a connection timeout when I try it. Thanks! Ognen

New sbt plugin to deploy jobs to EC2

2014-09-05 Thread Felix Garcia Borrego
As far as I know in other to deploy and execute jobs in EC2 you need to assembly you project, copy your jar into the cluster, log into using ssh and submit the job. To avoid having to do this I've been prototyping an sbt plugin(1) that allows to create and send Spark jobs to an Amazon EC2 cluster

Re: New sbt plugin to deploy jobs to EC2

2014-09-05 Thread andy petrella
\o/ = will test it soon or sooner, gr8 idea btw aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] http://about.me/noootsab On Fri, Sep 5, 2014 at 12:37 PM, Felix Garcia Borrego fborr...@gilt.com wrote: As far as I know in other to deploy and execute jobs in EC2 you need to

spark 1.1.0 requested array size exceed vm limits

2014-09-05 Thread marylucy
I am building spark-1.0-rc4 with maven,following http://spark.apache.org/docs/latest/building-with-maven.html But running graphx edgeFileList ,some tasks failed error:requested array size exceed vm limits error:executor lost Can anyone know how to fit it

RE: Web UI

2014-09-05 Thread Ruebenacker, Oliver A
Hello, Thanks for the explanation. So events are stored internally as JSON, but there is no official support for having Spark serve that JSON via HTTP? So if I wanted to write an app that monitors Spark, I would either have to scrape the web UI in HTML or rely on unofficial JSON

Task not serializable

2014-09-05 Thread Sarath Chandra
Hi, I'm trying to migrate a map-reduce program to work with spark. I migrated the program from Java to Scala. The map-reduce program basically loads a HDFS file and for each line in the file it applies several transformation functions available in various external libraries. When I execute this

Re: Task not serializable

2014-09-05 Thread Akhil Das
You can bring those classes out of the library and Serialize it (implements Serializable). It is not the right way of doing it though it solved few of my similar problems. Thanks Best Regards On Fri, Sep 5, 2014 at 7:36 PM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Hi,

Re: Task not serializable

2014-09-05 Thread Sarath Chandra
Hi Akhil, I've done this for the classes which are in my scope. But what to do with classes that are out of my scope? For example org.apache.hadoop.io.Text Also I'm using several 3rd part libraries like jeval. ~Sarath On Fri, Sep 5, 2014 at 7:40 PM, Akhil Das ak...@sigmoidanalytics.com wrote:

Spark-cassandra-connector 1.0.0-rc5: java.io.NotSerializableException

2014-09-05 Thread Shing Hing Man
Hi, My version of Spark is 1.0.2. I am trying to use Spark-cassandra-connector to execute an update csql statement inside an CassandraConnector(conf).withSessionDo block : CassandraConnector(conf).withSessionDo { session = { myRdd.foreach { case (ip,

Re: TimeStamp selection with SparkSQL

2014-09-05 Thread Brad Miller
My approach may be partly influenced by my limited experience with SQL and Hive, but I just converted all my dates to seconds-since-epoch and then selected samples from specific time ranges using integer comparisons. On Thu, Sep 4, 2014 at 6:38 PM, Cheng, Hao hao.ch...@intel.com wrote: There

replicated rdd storage problem

2014-09-05 Thread rapelly kartheek
Hi, Whenever I replicate an rdd, I find that the rdd gets replicated only in one node. I have a 3 node cluster. I set rdd.persist(StorageLevel.MEMORY_ONLY_2) in my application. The webUI shows that its replicates twice. But, the rdd stogare details show that its replicated only once and only in

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Davies Liu
Hi Oleg, In order to simplify the process of package and distribute you codes, you could deploy an shared storage (such as NFS), and put your project in it, mount it to all the slaves as /projects. In the spark job scripts, you can access your project by put the path into sys.path, such as:

Re: Spark that integrates with Kafka 0.7

2014-09-05 Thread Hemanth Yamijala
After searching a little bit, I came to know that Spark 0.8 supports kafka-0.7. So, I tried to use it this way: In my pom.xml, specified a Spark dependency as follows: dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming_2.9.3/artifactId

Re: Spark that integrates with Kafka 0.7

2014-09-05 Thread Hemanth Yamijala
After that long mail :-), I think I figured it out. I removed the 'provided' tag in my pom.xml and let the jars be directly included using maven's jar-with-dependencies plugin. Things started working after that. Thanks Hemanth On Fri, Sep 5, 2014 at 9:50 PM, Hemanth Yamijala yhema...@gmail.com

Re: How to list all registered tables in a sql context?

2014-09-05 Thread Jianshi Huang
Err... there's no such feature? Jianshi On Wed, Sep 3, 2014 at 7:03 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, How can I list all registered tables in a sql context? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ --

Re: TimeStamp selection with SparkSQL

2014-09-05 Thread Brad Miller
Preprocessing (after loading the data into HDFS). I started with data in JSON format in text files (stored in HDFS), and then loaded the data into parquet files with a bit of preprocessing and now I always retrieve the data by creating a SchemaRDD from the parquet file and using the SchemaRDD to

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Dimension Data, LLC.
Hi: Curious... is there any reason not to use one of the below pyspark options (in red)? Assuming each file is, say 10k in size, is 50 files too much? Does that touch on some practical limitation? Usage: ./bin/pyspark [options] Options: --master MASTER_URL spark://host:port,

Re: error: type mismatch while Union

2014-09-05 Thread Yana Kadiyska
Which version are you using -- I can reproduce your issue w/ 0.9.2 but not with 1.0.1...so my guess is that it's a bug and the fix hasn't been backported... No idea on a workaround though.. On Fri, Sep 5, 2014 at 7:58 AM, Dhimant dhimant84.jays...@gmail.com wrote: Hi, I am getting type

Re: [Spark Streaming] Tracking/solving 'block input not found'

2014-09-05 Thread Tathagata Das
Hey Gerard, Spark Streaming should just queue the processing and not delete the block data. There are reports of this error and I am still unable to reproduce the problem. One workaround you can try the configuration spark.streaming.unpersist = false . This stops Spark Streaming from cleaning up

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Oleg Ruchovets
Ok , I didn't explain my self correct: In case of java having a lot of classes jar should be used. All examples for PySpark I found is one py script( Pi , wordcount ...) , but in real environment analytics has more then one py file. My question is how to use PySpark on Yarn analytics in

spark-streaming-kafka with broadcast variable

2014-09-05 Thread Penny Espinoza
I need to use a broadcast variable inside the Decoder I use for class parameter T in org.apache.spark.streaming.kafka.KafkaUtils.createStream. I am using the override with this signature:

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Andrew Or
Hi Oleg, We do support serving python files in zips. If you use --py-files, you can provide a comma delimited list of zips instead of python files. This will allow you to automatically add these files to the python path on the executors without you having to manually copy them to every single

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Marcelo Vanzin
On Fri, Sep 5, 2014 at 10:50 AM, Davies Liu dav...@databricks.com wrote: In daily development, it's common to modify your projects and re-run the jobs. If using zip or egg to package your code, you need to do this every time after modification, I think it will be boring. That's why shell

Re: Viewing web UI after fact

2014-09-05 Thread Andrew Or
Hi Grzegorz, Can you verify that there are APPLICATION_COMPLETE files in the event log directories? E.g. Does file:/tmp/spark-events/app-name-1234567890/APPLICATION_COMPLETE exist? If not, it could be that your application didn't call sc.stop(), so the ApplicationEnd event is not actually logged.

Re: Web UI

2014-09-05 Thread Andrew Or
Sure, you can request it by filing an issue here: https://issues.apache.org/jira/browse/SPARK 2014-09-05 6:50 GMT-07:00 Ruebenacker, Oliver A oliver.ruebenac...@altisource.com: Hello, Thanks for the explanation. So events are stored internally as JSON, but there is no official

Re: spark-streaming-kafka with broadcast variable

2014-09-05 Thread Tathagata Das
I am not sure if there is a good, clean way to do that - broadcasts variables are not designed to be used out side spark job closures. You could try a bit of a hacky stuff where you write the serialized variable to file in HDFS / NFS / distributed files sytem, and then use a custom decoder class

Re: spark 1.1.0 requested array size exceed vm limits

2014-09-05 Thread Ankur Dave
At 2014-09-05 21:40:51 +0800, marylucy qaz163wsx_...@hotmail.com wrote: But running graphx edgeFileList ,some tasks failed error:requested array size exceed vm limits Try passing a higher value for minEdgePartitions when calling GraphLoader.edgeListFile. Ankur

Re: how to choose right DStream batch interval

2014-09-05 Thread qihong
repost since original msg was marked with This post has NOT been accepted by the mailing list yet. I have some questions regarding DStream batch interval: 1. if it only take 0.5 second to process the batch 99% of time, but 1% of batches need 5 seconds to process (due to some random factor or

Re: pyspark on yarn hdp hortonworks

2014-09-05 Thread Greg Hill
I'm running into a problem getting this working as well. I have spark-submit and spark-shell working fine, but pyspark in interactive mode can't seem to find the lzo jar: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found This is in

Re: Shared variable in Spark Streaming

2014-09-05 Thread Chris Fregly
good question, soumitra. it's a bit confusing. to break TD's code down a bit: dstream.count() is a transformation operation (returns a new DStream), executes lazily, runs in the cluster on the underlying RDDs that come through in that batch, and returns a new DStream with a single element

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Davies Liu
Here is a store about how shared storage simplify all the things: In Douban, we use Moose FS[1] instead of HDFS as the distributed file system, it's POSIX compatible and can be mounted just as NFS. We put all the data and tools and code in it, so we can access them easily on all the machines,

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Marcelo Vanzin
Hi Davies, On Fri, Sep 5, 2014 at 1:04 PM, Davies Liu dav...@databricks.com wrote: In Douban, we use Moose FS[1] instead of HDFS as the distributed file system, it's POSIX compatible and can be mounted just as NFS. Sure, if you already have the infrastructure in place, it might be worthwhile

Repartition inefficient

2014-09-05 Thread anthonyjschu...@gmail.com
I wonder if anyone has any tips for using repartition? It seems that when you call the repartition method, the entire RDD gets split up, shuffled, and redistributed... This is an extremely heavy task if you have a large hdfs dataset and all you want to do is make sure your RDD is balance/ data

Re: Shared variable in Spark Streaming

2014-09-05 Thread Tathagata Das
Good explanation, Chris :) On Fri, Sep 5, 2014 at 12:42 PM, Chris Fregly ch...@fregly.com wrote: good question, soumitra. it's a bit confusing. to break TD's code down a bit: dstream.count() is a transformation operation (returns a new DStream), executes lazily, runs in the cluster on

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread anthonyjschu...@gmail.com
I think that should be possible. Make sure spark is installed on your local machine and is the same version as on the cluster. -- View this message in context:

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread Ognen Duzlevski
On 9/5/2014 3:27 PM, anthonyjschu...@gmail.com wrote: I think that should be possible. Make sure spark is installed on your local machine and is the same version as on the cluster. It is the same version, I can telnet to master:7077 but when I run the spark-shell it times out.

Re: Sparse Matrices support in Spark

2014-09-05 Thread yannis_at
YOu are right,but unfortunately the size of the matrices that i am trying to craft it will outperform the capacities of the machines that i have access to,that is why i need sparse libraries. i wish there is a good library for sparse matrices in java,i will try to check scala that you mentioned.

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread qihong
the command should be spark-shell --master spark://master ip on EC2:7077. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-shell-or-queries-over-the-network-not-from-master-tp13543p13593.html Sent from the Apache Spark User List mailing list

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread Ognen Duzlevski
That is the command I ran and it still times out.Besides 7077 is there any other port that needs to be open? Thanks! Ognen On 9/5/2014 4:10 PM, qihong wrote: the command should be spark-shell --master spark://master ip on EC2:7077. -- View this message in context:

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread qihong
Since you are using your home computer, so it's probably not reachable by EC2 from internet. You can try to set spark.driver.host to your WAN ip, spark.driver.port to a fixed port in SparkConf, and open that port in your home network (port forwarding to the computer you are using). see if that

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread Ognen Duzlevski
Ah. So there is some kind of a back and forth going on. Thanks! Ognen On 9/5/2014 5:34 PM, qihong wrote: Since you are using your home computer, so it's probably not reachable by EC2 from internet. You can try to set spark.driver.host to your WAN ip, spark.driver.port to a fixed port in

prepending jars to the driver class path for spark-submit on YARN

2014-09-05 Thread Penny Espinoza
Hey - I’m struggling with some dependency issues with org.apache.httpcomponents httpcore and httpclient when using spark-submit with YARN running Spark 1.0.2 on a Hadoop 2.2 cluster. I’ve seen several posts about this issue, but no resolution. The error message is this: Caused by:

Re: spark 1.1.0 requested array size exceed vm limits

2014-09-05 Thread marylucy
I set 200,it remain failed in second step,(map and mapPartition in webui) In spark1.0.2 stable version ,it works well in first step,configuration same as 1.1.0 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For

Re: Huge matrix

2014-09-05 Thread Debasish Das
Hi Reza, Have you compared with the brute force algorithm for similarity computation with something like the following in Spark ? https://github.com/echen/scaldingale I am adding cosine similarity computation but I do want to compute an all pair similarities... Note that the data is sparse for

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
Hi Deb, We are adding all-pairs and thresholded all-pairs via dimsum in this PR: https://github.com/apache/spark/pull/1778 Your question wasn't entirely clear - does this answer it? Best, Reza On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das debasish.da...@gmail.com wrote: Hi Reza, Have you

Re: Huge matrix

2014-09-05 Thread Debasish Das
Ohh coolall-pairs brute force is also part of this PR ? Let me pull it in and test on our dataset... Thanks. Deb On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh r...@databricks.com wrote: Hi Deb, We are adding all-pairs and thresholded all-pairs via dimsum in this PR:

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
You might want to wait until Wednesday since the interface will be changing in that PR before Wednesday, probably over the weekend, so that you don't have to redo your code. Your call if you need it before a week. Reza On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das debasish.da...@gmail.com wrote:

Re: Huge matrix

2014-09-05 Thread Debasish Das
Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~ 60M and columns are 10M say with billion data points... I have another version that's around 60M and ~ 10K... I guess for the second one both all pair and dimsum will run fine... But for tall and wide, what do you suggest ?

Re: Huge matrix

2014-09-05 Thread Debasish Das
Also for tall and wide (rows ~60M, columns 10M), I am considering running a matrix factorization to reduce the dimension to say ~60M x 50 and then run all pair similarity... Did you also try similar ideas and saw positive results ? On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
For 60M x 10K brute force and dimsum thresholding should be fine. For 60M x 10M probably brute force won't work depending on the cluster's power, and dimsum thresholding should work with appropriate threshold. Dimensionality reduction should help, and how effective it is will depend on your

Re: Huge matrix

2014-09-05 Thread Debasish Das
I looked at the code: similarColumns(Double.posInf) is generating the brute force... Basically dimsum with gamma as PositiveInfinity will produce the exact same result as doing catesian products of RDD[(product, vector)] and computing similarities or there will be some approximation ? Sorry I

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
Yes you're right, calling dimsum with gamma as PositiveInfinity turns it into the usual brute force algorithm for cosine similarity, there is no sampling. This is by design. On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das debasish.da...@gmail.com wrote: I looked at the code:

Re: Huge matrix

2014-09-05 Thread Debasish Das
Awesome...Let me try it out... Any plans of putting other similarity measures in future (jaccard is something that will be useful) ? I guess it makes sense to add some similarity measures in mllib... On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh r...@databricks.com wrote: Yes you're right,

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
I will add dice, overlap, and jaccard similarity in a future PR, probably still for 1.2 On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das debasish.da...@gmail.com wrote: Awesome...Let me try it out... Any plans of putting other similarity measures in future (jaccard is something that will be

Array and RDDs

2014-09-05 Thread Deep Pradhan
Hi, I have an input file which consists of stc_node dest_node I have created and RDD consisting of key-value pair where key is the node id and the values are the children of that node. Now I want to associate a byte with each node. For that I have created a byte array. Every time I print out the