from:"\?\?\?\?\?\?\?\?"

Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Aaron Davidson

These errors should be fixed on master with Sean's PR: https://github.com/apache/spark/pull/209 The orbit errors are quite possibly due to using https instead of http, whether or not the SSL cert was bad. Let us know if they go away with reverting to http. On Sun, Mar 23, 2014 at 11:48 AM,

No space left on device exception

2014-03-23 Thread Ognen Duzlevski

Hello, I have a weird error showing up when I run a job on my Spark cluster. The version of spark is 0.9 and I have 3+ GB free on the disk when this error shows up. Any ideas what I should be looking for? [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed

Re: No space left on device exception

2014-03-23 Thread Aaron Davidson

On some systems, /tmp/ is an in-memory tmpfs file system, with its own size limit. It's possible that this limit has been exceeded. You might try running the df command to check to free space of /tmp or root if tmp isn't listed. 3 GB also seems pretty low for the remaining free space of a disk.

Re: combining operations elegantly

2014-03-23 Thread Patrick Wendell

Hey All, I think the old thread is here: https://groups.google.com/forum/#!msg/spark-users/gVtOp1xaPdU/Uyy9cQz9H_8J The method proposed in that thread is to create a utility class for doing single-pass aggregations. Using Algebird is a pretty good way to do this and is a bit more flexible since

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski

On 3/23/14, 5:49 PM, Matei Zaharia wrote: You can set spark.local.dir to put this data somewhere other than /tmp if /tmp is full. Actually it’s recommended to have multiple local disks and set to to a comma-separated list of directories, one per disk. Matei, does the number of tasks/partitions

Re: Problem with SparkR

2014-03-23 Thread Shivaram Venkataraman

Hi Thanks for reporting this. It'll be great if you can check a couple of things: 1. Are you trying to use this with Hadoop2 by any chance ? There was an incompatible ASM version bug that we fixed for Hadoop2 https://github.com/amplab-extras/SparkR-pkg/issues/17 and we verified it, but I just

is it possible to access the inputsplit in Spark directly?

2014-03-23 Thread hwpstorage

Hello, In spark we can use *newAPIHadoopRDD *to access the different distributed system like HDFS, HBase, and MongoDB via different inputformat. Is it possible to access the *inputsplit *in Spark directly? Spark can cache data in local memory. Perform local computation/aggregation on the local

Re: error loading large files in PySpark 0.9.0

2014-03-23 Thread Matei Zaharia

Hey Jeremy, what happens if you pass batchSize=10 as an argument to your SparkContext? It tries to serialize that many objects together at a time, which might be too much. By default the batchSize is 1024. Matei On Mar 23, 2014, at 10:11 AM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi

Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Bharath Bhushan

I don’t see the errors anymore. Thanks Aaron. On 24-Mar-2014, at 12:52 am, Aaron Davidson ilike...@gmail.com wrote: These errors should be fixed on master with Sean's PR: https://github.com/apache/spark/pull/209 The orbit errors are quite possibly due to using https instead of http,

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski

Aaron, thanks for replying. I am very much puzzled as to what is going on. A job that used to run on the same cluster is failing with this mysterious message about not having enough disk space when in fact I can see through watch df -h that the free space is always hovering around 3+GB on the

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski

Bleh, strike that, one of my slaves was at 100% inode utilization on the file system. It was /tmp/spark* leftovers that apparently did not get cleaned up properly after failed or interrupted jobs. Mental note - run a cron job on all slaves and master to clean up /tmp/spark* regularly. Thanks

Re: No space left on device exception

2014-03-23 Thread Aaron Davidson

Thanks for bringing this up, 100% inode utilization is an issue I haven't seen raised before and this raises another issue which is not on our current roadmap for state cleanup (cleaning up data which was not fully cleaned up from a crashed process). On Sun, Mar 23, 2014 at 7:57 PM, Ognen

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski

I would love to work on this (and other) stuff if I can bother someone with questions offline or on a dev mailing list. Ognen On 3/23/14, 10:04 PM, Aaron Davidson wrote: Thanks for bringing this up, 100% inode utilization is an issue I haven't seen raised before and this raises another issue

How many partitions is my RDD split into?

2014-03-23 Thread Nicholas Chammas

Hey there fellow Dukes of Data, How can I tell how many partitions my RDD is split into? I'm interested in knowing because, from what I gather, having a good number of partitions is good for performance. If I'm looking to understand how my pipeline is performing, say for a parallelized write out

Re: How many partitions is my RDD split into?

2014-03-23 Thread Mark Hamstra

It's much simpler: rdd.partitions.size On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hey there fellow Dukes of Data, How can I tell how many partitions my RDD is split into? I'm interested in knowing because, from what I gather, having a good number

Re: How many partitions is my RDD split into?

2014-03-23 Thread Patrick Wendell

As Mark said you can actually access this easily. The main issue I've seen from a performance perspective is people having a bunch of really small partitions. This will still work but the performance will improve if you coalesce the partitions using rdd.coalesce(). This can happen for example if

Re: No space left on device exception

2014-03-23 Thread Patrick Wendell

Ognen - just so I understand. The issue is that there weren't enough inodes and this was causing a No space left on device error? Is that correct? If so, that's good to know because it's definitely counter intuitive. On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski og...@nengoiksvelzud.com wrote:

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng

Hi Tsai, Could you share more information about the machine you used and the training parameters (runs, k, and iterations)? It can help solve your issues. Thanks! Best, Xiangrui On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming mailingl...@ltsai.com wrote: Hi, At the reduceBuyKey stage, it takes

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming

Hi, This is on a 4 nodes cluster each with 32 cores/256GB Ram. (0.9.0) is deployed in a stand alone mode. Each worker is configured with 192GB. Spark executor memory is also 192GB. This is on the first iteration. K=50. Here’s the code I use: http://pastebin.com/2yXL3y8i , which is a

spark executor/driver log files management

2014-03-24 Thread Sourav Chandra

Hi, I have few questions regarding log file management in spark: 1. Currently I did not find any way to modify the lof file name for executor/drivers). Its hardcoded as stdout and stderr. Also there is no log rotation. In case of streaming application this will grow forever and become

GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Sai Prasanna

Hi All !! I am getting the following error in interactive spark-shell [0.8.1] *org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed more than 0 times; aborting job java.lang.OutOfMemoryError: GC overhead limit exceeded* But i had set the following in the spark.env.sh and

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng

K = 50 is certainly a large number for k-means. If there is no particular reason to have 50 clusters, could you try to reduce it to, e.g, 100 or 1000? Also, the example code is not for large-scale problems. You should use the KMeans algorithm in mllib clustering for your problem.

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming

Thanks, Let me try with a smaller K. Does the size of the input data matters for the example? Currently I have 50M rows. What is a reasonable size to demonstrate the capability of Spark? On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng men...@gmail.com wrote: K = 50 is certainly a large

Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Aaron Davidson

To be clear on what your configuration will do: - SPARK_DAEMON_MEMORY=8g will make your standalone master and worker schedulers have a lot of memory. These do not impact the actual amount of useful memory given to executors or your driver, however, so you probably don't need to set this. -

Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Sean Owen

PS you have a typo in DEAMON - its DAEMON. Thanks Latin. On Mar 24, 2014 7:25 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Hi All !! I am getting the following error in interactive spark-shell [0.8.1] *org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed more than 0 times;

Re: Java API - Serialization Issue

2014-03-24 Thread santhoma

I am also facing the same problem. I have implemented Serializable for my code, but the exception is thrown from third party libraries on which I have no control . Exception in thread main org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: (lib

Re: java.io.NotSerializableException Of dependent Java lib.

2014-03-24 Thread santhoma

Can someone answer this question please? Specifically about the Serializable implementation of dependent jars .. ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-io-NotSerializableException-Of-dependent-Java-lib-tp1973p3087.html Sent from the Apache

Re: Spark worker threads waiting

2014-03-24 Thread sparrow

Yes my input data is partitioned in a completely random manner, so each worker that produces shuffle data processes only a part of it. The way I understand it is that before each stage each workers needs to distribute correct partitions (based on hash key ranges?) to other workers. And this is

Re: Spark Java example using external Jars

2014-03-24 Thread dmpour23

Hello, Has anyone got any ideas? I am not quite sure if my problem is an exact fit for Spark. Since in reality in this section of my program i am not really doing a reduce job simply a group by and partition. Would calling pipe on the Partiotined JavaRDD do the trick? Are there any examples

Re: No space left on device exception

2014-03-24 Thread Ognen Duzlevski

Patrick, correct. I have a 16 node cluster. On 14 machines out of 16, the inode usage was about 50%. On two of the slaves, one had inode usage of 96% and on the other it was 100%. When i went into /tmp on these two nodes - there were a bunch of /tmp/spark* subdirectories which I deleted. This

Re: How many partitions is my RDD split into?

2014-03-24 Thread Nicholas Chammas

Oh, glad to know it's that simple! Patrick, in your last comment did you mean filter in? As in I start with one year of data and filter it so I have one day left? I'm assuming in that case the empty partitions would be for all the days that got filtered out. Nick 2014년 3월 24일 월요일, Patrick

Re: java.io.NotSerializableException Of dependent Java lib.

2014-03-24 Thread yaoxin

We now have a method to work this around. For the classes that can't easily implement serialized, we wrap this class a scala object. For example: class A {} // This class is not serializable, object AHolder { private val a: A = new A() def get: A = a } This

mapPartitions use case

2014-03-24 Thread Jaonary Rabarisoa

Dear all, Sorry for asking such a basic question, but someone can explain when one should use mapPartiontions instead of map. Thanks Jaonary

Akka error with largish job (works fine for smaller versions)

2014-03-24 Thread Nathan Kronenfeld

What does this error mean: @hadoop-s2.oculus.local:45186]: Error [Association failed with [akka.tcp://spark@hadoop-s2.oculus.local:45186]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@hadoop-s2.oculus.local:45186] Caused by:

Re: No space left on device exception

2014-03-24 Thread Ognen Duzlevski

Another thing I have noticed is that out of my master+15 slaves, two slaves always carry a higher inode load. So for example right now I am running an intensive job that takes about an hour to finish and two slaves have been showing an increase in inode consumption (they are about 10% above

Re: mapPartitions use case

2014-03-24 Thread Nathan Kronenfeld

I've seen two cases most commonly: The first is when I need to create some processing object to process each record. If that object creation is expensive, creating one per record becomes prohibitive. So instead, we use mapPartition, and create one per partition, and use it on each record in the

Re: Sliding Window operations do not work as documented

2014-03-24 Thread Sanjay Awatramani

Hi All, I found out why this problem exists. Consider the following scenario: - a DStream is created from any source. (I've checked with file and socket) - No actions are applied to this DStream - Sliding Window operation is applied to this DStream and an action is applied to the sliding window.

Re: How many partitions is my RDD split into?

2014-03-24 Thread Nicholas Chammas

Mark, This appears to be a Scala-only feature. :( Patrick, Are we planning to add this to PySpark? Nick On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra m...@clearstorydata.comwrote: It's much simpler: rdd.partitions.size On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas

Re: Problem starting worker processes in standalone mode

2014-03-24 Thread Yonathan Perez

Oh, I also forgot to mention: I start the master and workers (call ./sbin/start-all.sh), and then start the shell: MASTER=spark://localhost:7077 ./bin/spark-shell Then I get the exceptions... Thanks -- View this message in context:

remove duplicates

2014-03-24 Thread Adrian Mocanu

I have a DStream like this: ..RDD[a,b],RDD[b,c].. Is there a way to remove duplicates across the entire DStream? Ie: I would like the output to be (by removing one of the b's): ..RDD[a],RDD[b,c].. or ..RDD[a,b],RDD[c].. Thanks -Adrian

Re: sbt/sbt assembly fails with ssl certificate error

2014-03-24 Thread Debasish Das

After a long time (may be a month) I could do a fresh build for 2.0.0-mr1-cdh4.5.0...I was using the cached files in .ivy2/cache My case is especially painful since I have to build behind a firewall... @Sean thanks for the fix...I think we should put a test for https/firewall compilation as

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng

Number of rows doesn't matter much as long as you have enough workers to distribute the work. K-means has complexity O(n * d * k), where n is number of points, d is the dimension, and k is the number of clusters. If you use the KMeans implementation from MLlib, the initialization stage is done on

distinct in data frame in spark

2014-03-24 Thread Chengi Liu

Hi, I have a very simple use case: I have an rdd as following: d = [[1,2,3,4],[1,5,2,3],[2,3,4,5]] Now, I want to remove all the duplicates from a column and return the remaining frame.. For example: If i want to remove the duplicate based on column 1. Then basically I would remove either row

Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Aaron Davidson

1. Note sure on this, I don't believe we change the defaults from Java. 2. SPARK_JAVA_OPTS can be used to set the various Java properties (other than memory heap size itself) 3. If you want to have 8 GB executors then, yes, only two can run on each 16 GB node. (In fact, you should also keep a

Comparing GraphX and GraphLab

2014-03-24 Thread Niko Stahl

Hello, I'm interested in extending the comparison between GraphX and GraphLab presented in Xin et. al (2013). The evaluation presented there is rather limited as it only compares the frameworks for one algorithm (PageRank) on a cluster with a fixed number of nodes. Are there any graph algorithms

Re: Comparing GraphX and GraphLab

2014-03-24 Thread Debasish Das

Niko, Comparing some other components will be very useful as wellsvd++ from graphx vs the same algorithm in graphlabalso mllib.recommendation.als implicit/explicit compared to the collaborative filtering toolkit in graphlab... To stress test what's the biggest sparse dataset that you

Re: Transitive dependency incompatibility

2014-03-24 Thread Jaka Jančar

Just found this issue and wanted to link it here, in case somebody finds this thread later: https://spark-project.atlassian.net/browse/SPARK-939 On Thursday, March 20, 2014 at 11:14 AM, Matei Zaharia wrote: Hi Jaka, I’d recommend rebuilding Spark with a new version of the HTTPClient

quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll

Has anyone successfully followed the instructions on the Quick Start page of the Spark home page to run a standalone Scala application? I can't, and I figure I must be missing something obvious! I'm trying to follow the instructions here as close to word for word as possible:

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu

Hi, Diana, See my inlined answer -- Nan Zhu On Monday, March 24, 2014 at 3:44 PM, Diana Carroll wrote: Has anyone successfully followed the instructions on the Quick Start page of the Spark home page to run a standalone Scala application? I can't, and I figure I must be missing

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Yana Kadiyska

I am able to run standalone apps. I think you are making one mistake that throws you off from there onwards. You don't need to put your app under SPARK_HOME. I would create it in its own folder somewhere, it follows the rules of any standalone scala program (including the layout). In the giude,

Re: Comparing GraphX and GraphLab

2014-03-24 Thread Ankur Dave

Hi Niko, The GraphX team recently wrote a longer paper with more benchmarks and optimizations: http://arxiv.org/abs/1402.2394 Regarding the performance of GraphX vs. GraphLab, I believe GraphX currently outperforms GraphLab only in end-to-end benchmarks of pipelines involving both graph-parallel

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll

Yana: Thanks. Can you give me a transcript of the actual commands you are running? THanks! Diana On Mon, Mar 24, 2014 at 3:59 PM, Yana Kadiyska yana.kadiy...@gmail.comwrote: I am able to run standalone apps. I think you are making one mistake that throws you off from there onwards. You

[bug?] streaming window unexpected behaviour

2014-03-24 Thread Adrian Mocanu

I have what I would call unexpected behaviour when using window on a stream. I have 2 windowed streams with a 5s batch interval. One window stream is (5s,5s)=smallWindow and the other (10s,5s)=bigWindow What I've noticed is that the 1st RDD produced by bigWindow is incorrect and is of the size

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Ognen Duzlevski

Diana, Anywhere on the filesystem you have read/write access (you need not be in your spark home directory): mkdir myproject cd myproject mkdir project mkdir target mkdir -p src/main/scala cp $mypath/$mymysource.scala src/main/scala/ cp $mypath/myproject.sbt . Make sure that myproject.sbt

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll

Thanks, Nan Zhu. You say that my problems are because you are in Spark directory, don't need to do that actually , the dependency on Spark is resolved by sbt I did try it initially in what I thought was a much more typical place, e.g. ~/mywork/sparktest1. But as I said in my email: (Just for

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll

Thanks Ongen. Unfortunately I'm not able to follow your instructions either. In particular: sbt compile sbt run arguments if any This doesn't work for me because there's no program on my path called sbt. The instructions in the Quick Start guide are specific that I should call

Re: How many partitions is my RDD split into?

2014-03-24 Thread Shivaram Venkataraman

There is no direct way to get this in pyspark, but you can get it from the underlying java rdd. For example a = sc.parallelize([1,2,3,4], 2) a._jrdd.splits().size() On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Mark, This appears to be a Scala-only

question about partitions

2014-03-24 Thread Walrus theCat

Hi, Quick question about partitions. If my RDD is partitioned into 5 partitions, does that mean that I am constraining it to exist on at most 5 machines? Thanks

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll

Yeah, that's exactly what I did. Unfortunately it doesn't work: $SPARK_HOME/sbt/sbt package awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for reading (No such file or directory) Attempting to fetch sbt /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or

Re: How many partitions is my RDD split into?

2014-03-24 Thread Patrick Wendell

Ah we should just add this directly in pyspark - it's as simple as the code Shivaram just wrote. - Patrick On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman shivaram.venkatara...@gmail.com wrote: There is no direct way to get this in pyspark, but you can get it from the underlying java

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Ognen Duzlevski

Ah crud, I guess you are right, I am using the sbt I installed manually with my Scala installation. Well, here is what you can do: mkdir ~/bin cd ~/bin wget http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.1/sbt-launch.jar vi sbt Put the following contents into

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu

Hi, Diana, You don’t need to use spark-distributed sbt just download sbt from its official website and set your PATH to the right place Best, -- Nan Zhu On Monday, March 24, 2014 at 4:30 PM, Diana Carroll wrote: Yeah, that's exactly what I did. Unfortunately it doesn't work:

Re: question about partitions

2014-03-24 Thread Walrus theCat

For instance, I need to work with an RDD in terms of N parts. Will calling RDD.coalesce(N) possibly cause processing bottlenecks? On Mon, Mar 24, 2014 at 1:28 PM, Walrus theCat walrusthe...@gmail.comwrote: Hi, Quick question about partitions. If my RDD is partitioned into 5 partitions,

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll

Thanks for your help, everyone. Several folks have explained that I can surely solve the problem by installing sbt. But I'm trying to get the instructions working *as written on the Spark website*. The instructions not only don't have you install sbt separately...they actually specifically have

Re: question about partitions

2014-03-24 Thread Syed A. Hashmi

RDD.coalesce should be fine for rebalancing data across all RDD partitions. Coalesce is pretty handy in situations where you have sparse data and want to compact it (e.g. data after applying a strict filter) OR you know the magic number of partitions according to your cluster which will be

Re: Writing RDDs to HDFS

2014-03-24 Thread Diana Carroll

Ongen: I don't know why your process is hanging, sorry. But I do know that the way saveAsTextFile works is that you give it a path to a directory, not a file. The file is saved in multiple parts, corresponding to the partitions. (part-0, part-1 etc.) (Presumably it does this because it

Re: Writing RDDs to HDFS

2014-03-24 Thread Ognen Duzlevski

Diana, thanks. I am not very well acquainted with HDFS. I use hdfs -put to put things as files into the filesystem (and sc.textFile to get stuff out of them in Spark) and I see that they appear to be saved as files that are replicated across 3 out of the 16 nodes in the hdfs cluster (which is

Re: error loading large files in PySpark 0.9.0

2014-03-24 Thread Jeremy Freeman

Thanks Matei, unfortunately doesn't seem to fix it. I tried batchSize = 10, 100, as well as 1 (which should reproduce the 0.8.1 behavior?), and it stalls at the same point in each case. -- Jeremy - jeremy freeman, phd neuroscientist @thefreemanlab On Mar 23, 2014, at 9:56

Re: Sliding Window operations do not work as documented

2014-03-24 Thread Tathagata Das

Hello Sanjay, Yes, your understanding of lazy semantics is correct. But ideally every batch should read based on the batch interval provided in the StreamingContext. Can you open a JIRA on this? On Mon, Mar 24, 2014 at 7:45 AM, Sanjay Awatramani sanjay_a...@yahoo.com wrote: Hi All, I found

Re: [bug?] streaming window unexpected behaviour

2014-03-24 Thread Tathagata Das

Yes, I believe that is current behavior. Essentially, the first few RDDs will be partial windows (assuming window duration sliding interval). TD On Mon, Mar 24, 2014 at 1:12 PM, Adrian Mocanu amoc...@verticalscope.com wrote: I have what I would call unexpected behaviour when using window on a

Re: question about partitions

2014-03-24 Thread Walrus theCat

Syed, Thanks for the tip. I'm not sure if coalesce is doing what I'm intending to do, which is, in effect, to subdivide the RDD into N parts (by calling coalesce and doing operations on the partitions.) It sounds like, however, this won't bottleneck my processing power. If this sets off any

Re: Writing RDDs to HDFS

2014-03-24 Thread Ognen Duzlevski

Just so I can close this thread (in case anyone else runs into this stuff) - I did sleep through the basics of Spark ;). The answer on why my job is in waiting state (hanging) is here: http://spark.incubator.apache.org/docs/latest/spark-standalone.html#resource-scheduling Ognen On 3/24/14,

Re: combining operations elegantly

2014-03-24 Thread Richard Siebeling

Hi guys, thanks for the information, I'll give it a try with Algebird, thanks again, Richard @Patrick, thanks for the release calendar On Mon, Mar 24, 2014 at 12:16 AM, Patrick Wendell pwend...@gmail.comwrote: Hey All, I think the old thread is here:

Splitting RDD and Grouping together to perform computation

2014-03-24 Thread yh18190

Hi,I have large data set of numbers ie RDD and wanted to perform a computation only on groupof two values at a time.For example1,2,3,4,5,6,7... is an RDDCan i group the RDD into (1,2),(3,4),(5,6)...?? and perform the respective computations ?in an efficient manner?As we do'nt have a way to index

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Yana Kadiyska

Diana, I think you are correct - I just installed wget http://mirror.symnds.com/software/Apache/incubator/spark/spark-0.9.0-incubating/spark-0.9.0-incubating-bin-cdh4.tgz and indeed I see the same error that you see It looks like in previous versions sbt-launch used to just come down in the

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu

I found that I never read the document carefully and I never find that Spark document is suggesting you to use Spark-distributed sbt…… Best, -- Nan Zhu On Monday, March 24, 2014 at 5:47 PM, Diana Carroll wrote: Thanks for your help, everyone. Several folks have explained that I can

Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread yh18190

We need some one who can explain us with short code snippet on given example so that we get clear cut idea on RDDs indexing.. Guys please help us -- View this message in context:

coalescing RDD into equally sized partitions

2014-03-24 Thread Walrus theCat

Hi, sc.parallelize(Array.tabulate(100)(i=i)).filter( _ % 20 == 0 ).coalesce(5,true).glom.collect yields Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(), Array(), Array()) How do I get something more like: Array(Array(0), Array(20), Array(40), Array(60), Array(80))

Re: N-Fold validation and RDD partitions

2014-03-24 Thread Holden Karau

There is also https://github.com/apache/spark/pull/18 against the current repo which may be easier to apply. On Fri, Mar 21, 2014 at 8:58 AM, Hai-Anh Trinh a...@adatao.com wrote: Hi Jaonary, You can find the code for k-fold CV in https://github.com/apache/incubator-spark/pull/448. I have

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll

It is suggested implicitly in giving you the command ./sbt/sbt. The separately installed sbt isn't in a folder called sbt, whereas Spark's version is. And more relevantly, just a few paragraphs earlier in the tutorial you execute the command sbt/sbt assembly which definitely refers to the spark

Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread Nan Zhu

partition your input into even number partitions use mapPartition to operate on Iterator[Int] maybe there are some more efficient way…. Best, -- Nan Zhu On Monday, March 24, 2014 at 7:59 PM, yh18190 wrote: Hi, I have large data set of numbers ie RDD and wanted to perform a

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu

Yes, actually even for spark, I mostly use the sbt I installed…..so always missing this issue…. If you can reproduce the problem with a spark-distribtued sbt…I suggest proposing a PR to fix the document, before 0.9.1 is officially released Best, -- Nan Zhu On Monday, March 24, 2014

Re: Writing RDDs to HDFS

2014-03-24 Thread Yana Kadiyska

Ognen, can you comment if you were actually able to run two jobs concurrently with just restricting spark.cores.max? I run Shark on the same cluster and was not able to see a standalone job get in (since Shark is a long running job) until I restricted both spark.cores.max _and_

Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread Nan Zhu

I didn’t group the integers, but process them in group of two, partition that scala val a = sc.parallelize(List(1, 2, 3, 4), 2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:12 process each partition and process elements in the partition in group

Re: RDD usage

2014-03-24 Thread hequn cheng

points.foreach(p=p.y = another_value) will return a new modified RDD. 2014-03-24 18:13 GMT+08:00 Chieh-Yen r01944...@csie.ntu.edu.tw: Dear all, I have a question about the usage of RDD. I implemented a class called AppDataPoint, it looks like: case class AppDataPoint(input_y : Double,

答复: RDD usage

2014-03-24 Thread 林武康

Hi hequn, a relative question, is that mean the memory usage will doubled? And further more, if the compute function in a rdd is not idempotent, rdd will changed during the job running, is that right? -原始邮件- 发件人: hequn cheng chenghe...@gmail.com 发送时间: ‎2014/‎3/‎25 9:35 收件人:

Re: RDD usage

2014-03-24 Thread Mark Hamstra

No, it won't. The type of RDD#foreach is Unit, so it doesn't return an RDD. The utility of foreach is purely for the side effects it generates, not for its return value -- and modifying an RDD in place via foreach is generally not a very good idea. On Mon, Mar 24, 2014 at 6:35 PM, hequn cheng

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng

Sorry, I meant the master branch of https://github.com/apache/spark. -Xiangrui On Mon, Mar 24, 2014 at 6:27 PM, Tsai Li Ming mailingl...@ltsai.com wrote: Thanks again. If you use the KMeans implementation from MLlib, the initialization stage is done on master, The master here is the

Re: 答复: RDD usage

2014-03-24 Thread hequn cheng

First question: If you save your modified RDD like this: points.foreach(p=p.y = another_value).collect() or points.foreach(p=p.y = another_value).saveAsTextFile(...) the modified RDD will be materialized and this will not use any work's memory. If you have more transformatins after the map(), the

答复: 答复: RDD usage

2014-03-24 Thread 林武康

Hi hequn, I dig into the source of spark a bit deeper, and I got some ideas, firstly, immutable is a feather of rdd but not a solid rule, there are ways to change it, for excample, a rdd with non-idempotent compute function, though it is really a bad design to make that function non-idempotent

Re: distinct in data frame in spark

2014-03-25 Thread Andrew Ash

My thought would be to key by the first item in each array, then take just one array for each key. Something like the below: v = sc.parallelize(Seq(Seq(1,2,3,4),Seq(1,5,2,3),Seq(2,3,4,5))) col = 0 output = v.keyBy(_(col)).reduceByKey(a,b = a).values On Tue, Mar 25, 2014 at 1:21 AM, Chengi Liu

Re: Shark does not give any results with SELECT count(*) command

2014-03-25 Thread qingyang li

reopen this thread because i encounter this problem again. Here is my env: scala 2.10.3 s spark 0.9.0tandalone mode shark 0.9.0downlaod the source code and build by myself hive hive-shark-0.11 I have copied hive-site.xml from my hadoop cluster , it's hive version is 0.12, after copied , i

Re: Java API - Serialization Issue

2014-03-25 Thread santhoma

This worked great. Thanks a lot -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Java-API-Serialization-Issue-tp1460p3178.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

tracking resource usage for spark-shell commands

2014-03-25 Thread Bharath Bhushan

Is there a way to see the resource usage of each spark-shell command — say time taken and memory used? I checked the WebUI of spark-shell and of the master and I don’t see any such breakdown. I see the time taken in the INFO logs but nothing about memory usage. It would also be nice to track

Re: Shark does not give any results with SELECT count(*) command

2014-03-25 Thread Praveen R

Hi Qingyang Li, Shark-0.9.0 uses a patched version of hive-0.11 and using configuration/metastore of hive-0.12 could be incompatible. May I know the reason you are using hive-site.xml from previous hive version(to use existing metastore?). You might just leave hive-site.xml blank, otherwise.

Re: Akka error with largish job (works fine for smaller versions)

2014-03-25 Thread Andrew Ash

Possibly one of your executors is in the middle of a large stop-the-world GC and doesn't respond to network traffic during that period? If you shared some information about how each node in your cluster is set up (heap size, memory, CPU, etc) that might help with debugging. Andrew On Mon, Mar

Re: Pig on Spark

2014-03-25 Thread lalit1303

Hi, I have been following Aniket's spork github repository. https://github.com/aniket486/pig I have done all the changes mentioned in recently modified pig-spark file. I am using: hadoop 2.0.5 alpha spark-0.8.1-incubating mesos 0.16.0 ##PIG variables export

Change print() in JavaNetworkWordCount

2014-03-25 Thread Eduardo Costa Alfaia

Hi Guys, I think that I already did this question, but I don't remember if anyone has answered me. I would like changing in the function print() the quantity of words and the frequency number that are sent to driver's screen. The default value is 10. Anyone could help me with this? Best

Re: Change print() in JavaNetworkWordCount

2014-03-25 Thread Sourav Chandra

You can extend DStream and override print() method. Then you can create your own DSTream extending from this. On Tue, Mar 25, 2014 at 6:07 PM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: Hi Guys, I think that I already did this question, but I don't remember if anyone has answered

tuple as keys in pyspark show up reversed

2014-03-25 Thread Friso van Vollenhoven

Hi, I have an example where I use a tuple of (int,int) in Python as key for a RDD. When I do a reduceByKey(...), sometimes the tuples turn up with the two int's reversed in order (which is problematic, as the ordering is part of the key). Here is a ipython notebook that has some code and

< 1 2 3 4 5 6 7 8 9 10 >

401 - 500 of 75449 matches

Mail list logo