from:"Patrick"

Re: Setting executor memory when using spark-shell

2014-06-06 Thread Patrick Wendell

In 1.0+ you can just pass the --executor-memory flag to ./bin/spark-shell. On Fri, Jun 6, 2014 at 12:32 AM, Oleg Proudnikov wrote: > Thank you, Hassan! > > > On 6 June 2014 03:23, hassan wrote: >> >> just use -Dspark.executor.memory= >> >> >> >> -- >> View this message in context: >> http://apac

Re: Spark 1.0 & embedded Hive libraries

2014-06-06 Thread Patrick Wendell

ke it work. I think it's being tracked by this JIRA: https://issues.apache.org/jira/browse/HIVE-5733 - Patrick On Fri, Jun 6, 2014 at 12:08 PM, Silvio Fiorito wrote: > Is there a repo somewhere with the code for the Hive dependencies > (hive-exec, hive-serde, & hive-metastore) u

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-08 Thread Patrick Wendell

the jar because they go beyond the extended zip boundary `jar tvf` won't list them. - Patrick On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown wrote: > Moving over to the dev list, as this isn't a user-scope issue. > > I just ran into this issue with the missing saveAsTestFile, an

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-08 Thread Patrick Wendell

Also I should add - thanks for taking time to help narrow this down! On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell wrote: > Paul, > > Could you give the version of Java that you are building with and the > version of Java you are running with? Are they the same? > > Just off

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-08 Thread Patrick Wendell

Okay I think I've isolated this a bit more. Let's discuss over on the JIRA: https://issues.apache.org/jira/browse/SPARK-2075 On Sun, Jun 8, 2014 at 1:16 PM, Paul Brown wrote: > > Hi, Patrick -- > > Java 7 on the development machines: > > » java -version > 1 ↵ >

Re: Setting spark memory limit

2014-06-09 Thread Patrick Wendell

I you run locally then Spark doesn't launch remote executors. However, in this case you can set the memory with --spark-driver-memory flag to spark-submit. Does that work? - Patrick On Mon, Jun 9, 2014 at 3:24 PM, Henggang Cui wrote: > Hi, > > I'm trying to run the Simple

Re: Enormous EC2 price jump makes "r3.large" patch more important

2014-06-17 Thread Patrick Wendell

Hey Jeremy, This is patched in the 1.0 and 0.9 branches of Spark. We're likely to make a 1.0.1 release soon (this patch being one of the main reasons), but if you are itching for this sooner, you can just checkout the head of branch-1.0 and you will be able to use r3.XXX instances. - Patric

Re: Enormous EC2 price jump makes "r3.large" patch more important

2014-06-17 Thread Patrick Wendell

By the way, in case it's not clear, I mean our maintenance branches: https://github.com/apache/spark/tree/branch-1.0 On Tue, Jun 17, 2014 at 8:35 PM, Patrick Wendell wrote: > Hey Jeremy, > > This is patched in the 1.0 and 0.9 branches of Spark. We're likely to > make a 1.

Re: Enormous EC2 price jump makes "r3.large" patch more important

2014-06-17 Thread Patrick Wendell

which will be present in the 1.0 branch of Spark. - Patrick On Tue, Jun 17, 2014 at 9:29 PM, Jeremy Lee wrote: > I am about to spin up some new clusters, so I may give that a go... any > special instructions for making them work? I assume I use the " > --spark-git-repo=" option

Re: Wildcard support in input path

2014-06-17 Thread Patrick Wendell

These paths get passed directly to the Hadoop FileSystem API and I think the support globbing out-of-the box. So AFAIK it should just work. On Tue, Jun 17, 2014 at 9:09 PM, MEETHU MATHEW wrote: > Hi Jianshi, > > I have used wild card characters (*) in my program and it worked.. > My code was like

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-17 Thread Patrick Wendell

Out of curiosity - are you guys using speculation, shuffle consolidation, or any other non-default option? If so that would help narrow down what's causing this corruption. On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman wrote: > Matt/Ryan, > > Did you make any headway on this? My team is

Re: Trailing Tasks Saving to HDFS

2014-06-19 Thread Patrick Wendell

I'll make a comment on the JIRA - thanks for reporting this, let's get to the bottom of it. On Thu, Jun 19, 2014 at 11:19 AM, Surendranauth Hiraman wrote: > I've created an issue for this but if anyone has any advice, please let me > know. > > Basically, on about 10 GBs of data, saveAsTextFile()

Re: 1.0.1 release plan

2014-06-20 Thread Patrick Wendell

Hey There, I'd like to start voting on this release shortly because there are a few important fixes that have queued up. We're just waiting to fix an akka issue. I'd guess we'll cut a vote in the next few days. - Patrick On Thu, Jun 19, 2014 at 10:47 AM, Mingyu Kim wro

Re: hadoop + yarn + spark

2014-06-27 Thread Patrick Wendell

Hi There, There is an issue with PySpark-on-YARN that requires users build with Java 6. The issue has to do with how Java 6 and 7 package jar files differently. Can you try building spark with Java 6 and trying again? - Patrick On Fri, Jun 27, 2014 at 5:00 PM, sdeb wrote: > Hello, > &g

Re: How to clear the list of Completed Appliations in Spark web UI?

2014-07-09 Thread Patrick Wendell

There isn't currently a way to do this, but it will start dropping older applications once more than 200 are stored. On Wed, Jul 9, 2014 at 4:04 PM, Haopu Wang wrote: > Besides restarting the Master, is there any other way to clear the > Completed Applications in Master web UI?

Re: Purpose of spark-submit?

2014-07-09 Thread Patrick Wendell

It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of c

Re: issues with ./bin/spark-shell for standalone mode

2014-07-09 Thread Patrick Wendell

Hey Mikhail, I think (hope?) the -em and -dm options were never in an official Spark release. They were just in the master branch at some point. Did you use these during a previous Spark release or were you just on master? - Patrick On Wed, Jul 9, 2014 at 9:18 AM, Mikhail Strebkov wrote

Announcing Spark 1.0.1

2014-07-11 Thread Patrick Wendell

I am happy to announce the availability of Spark 1.0.1! This release includes contributions from 70 developers. Spark 1.0.0 includes fixes across several areas of Spark, including the core API, PySpark, and MLlib. It also includes new features in Spark's (alpha) SQL library, including support for J

Re: Announcing Spark 1.0.1

2014-07-12 Thread Patrick Wendell

> -Brad > > On Fri, Jul 11, 2014 at 8:44 PM, Henry Saputra > wrote: >> Congrats to the Spark community ! >> >> On Friday, July 11, 2014, Patrick Wendell wrote: >>> >>> I am happy to announce the availability of Spark 1.0.1! This release >>

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Patrick Wendell

Adding new build modules is pretty high overhead, so if this is a case where a small amount of duplicated code could get rid of the dependency, that could also be a good short-term option. - Patrick On Mon, Jul 14, 2014 at 2:15 PM, Matei Zaharia wrote: > Yeah, I'd just add a spark-util

Re: how to publish spark inhouse?

2014-07-28 Thread Patrick Wendell

All of the scripts we use to publish Spark releases are in the Spark repo itself, so you could follow these as a guideline. The publishing process in Maven is similar to in SBT: https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L65 On Mon, Jul 28, 2014 at 12:39 PM,

Spark SQL, Parquet and Impala

2014-08-01 Thread Patrick McGloin

Hi, We would like to use Spark SQL to store data in Parquet format and then query that data using Impala. We've tried to come up with a solution and it is working but it doesn't seem good. So I was wondering if you guys could tell us what is the correct way to do this. We are using Spark 1.0 an

Re: Spark SQL, Parquet and Impala

2014-08-01 Thread Patrick McGloin

How should we insert data from SparkSQL into a Parquet table which can be directly queried by Impala? Best regards, Patrick On 1 August 2014 16:18, Patrick McGloin wrote: > Hi, > > We would like to use Spark SQL to store data in Parquet format and then > query that data using Impa

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Patrick Wendell

This is a Scala bug - I filed something upstream, hopefully they can fix it soon and/or we can provide a work around: https://issues.scala-lang.org/browse/SI-8772 - Patrick On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau wrote: > Currently scala 2.10.2 can't be pulled in from maven ce

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Patrick Wendell

I've had intermiddent access to the artifacts themselves, but for me the directory listing always 404's. I think if sbt hits a 404 on the directory, it sends a somewhat confusing error message that it can't download the artifact. - Patrick On Fri, Aug 1, 2014 at 3:28 PM, Shivar

Re: Spark SQL, Parquet and Impala

2014-08-02 Thread Patrick McGloin

're unsure of the best practice for loading data into Parquet tables. Is the way we are doing the Spark part correct in your opinion? Best regards, Patrick On 1 August 2014 19:32, Michael Armbrust wrote: > So is the only issue that impala does not see changes until you refresh

Re: disable log4j for spark-shell

2014-08-03 Thread Patrick Wendell

If you want to customize the logging behavior - the simplest way is to copy conf/log4j.properties.tempate to conf/log4j.properties. Then you can go and modify the log level in there. The spark shells should pick this up. On Sun, Aug 3, 2014 at 6:16 AM, Sean Owen wrote: > That's just a templat

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-03 Thread Patrick Wendell

Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Fri, Aug 1, 2014 at 3:31 PM, Patrick Wendell > wrote: > > I've had intermiddent access to the artifacts themselves, but for me the > > directory listing always 404's. > > >

Re: Low Level Kafka Consumer for Spark

2014-08-03 Thread Patrick Wendell

I'll let TD chime on on this one, but I'm guessing this would be a welcome addition. It's great to see community effort on adding new streams/receivers, adding a Java API for receivers was something we did specifically to allow this :) - Patrick On Sat, Aug 2, 2014 at 10:

Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-04 Thread Patrick Wendell

thub.com/apache/spark/pull/1165 A (potential) workaround would be to first persist your data to disk, then re-partition it, then cache it. I'm not 100% sure whether that will work though. val a = sc.textFile("s3n://some-path/*.json").persist(DISK_ONLY).repartition(larger nr of parti

Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-04 Thread Patrick Wendell

BTW - the reason why the workaround could help is because when persisting to DISK_ONLY, we explicitly avoid materializing the RDD partition in memory... we just pass it through to disk On Mon, Aug 4, 2014 at 1:10 AM, Patrick Wendell wrote: > It seems possible that you are running out of mem

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Patrick Wendell

For hortonworks, I believe it should work to just link against the corresponding upstream version. I.e. just set the Hadoop version to "2.4.0" Does that work? - Patrick On Mon, Aug 4, 2014 at 12:13 AM, Ron's Yahoo! wrote: > Hi, > Not sure whose issue this is, but if I

Re: Cached RDD Block Size - Uneven Distribution

2014-08-04 Thread Patrick Wendell

Are you directly caching files from Hadoop or are you doing some transformation on them first? If you are doing a groupBy or some type of transformation, then you could be causing data skew that way. On Sun, Aug 3, 2014 at 1:19 PM, iramaraju wrote: > I am running spark 1.0.0, Tachyon 0.5 and Ha

Re: Bad Digest error while doing aws s3 put

2014-08-04 Thread Patrick Wendell

You are hitting this issue: https://issues.apache.org/jira/browse/SPARK-2075 On Mon, Jul 28, 2014 at 5:40 AM, lmk wrote: > Hi > I was using saveAsTextFile earlier. It was working fine. When we migrated > to > spark-1.0, I started getting the following error: > java.lang.ClassNotFoundException:

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Patrick Wendell

4 -Dhadoop.version=2.4.0.2.1.1.0-385 > -DskipTests clean package > > I haven¹t tried building a distro, but it should be similar. > > > - SteveN > > On 8/4/14, 1:25, "Sean Owen" wrote: > > For any Hadoop 2.4 distro, yes, set hadoop.version but also set > -Phadoop

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Patrick Wendell

gt; > Thanks, > Ron > > On Aug 4, 2014, at 10:01 AM, Ron's Yahoo! wrote: > > That failed since it defaulted the versions for yarn and hadoop > I'll give it a try with just 2.4.0 for both yarn and hadoop... > > Thanks, > Ron > > On Aug 4, 2014, at 9:44

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-05 Thread Patrick Wendell

ingle task. In the latest version of Spark we've added documentation to make this distinction more clear to users: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390 - Patrick On Tue, Aug 5, 2014 at 6:13 AM, Jens Kristian Geyti wro

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-05 Thread Patrick Wendell

ay each group out sequentially on disk on one big file, you can call `sortByKey` with a hashed suffix as well. The sort functions are externalized in Spark 1.1 (which is in pre-release). - Patrick On Tue, Aug 5, 2014 at 2:39 PM, Jens Kristian Geyti wrote: > Patrick Wendell wrote > > In

Re: type issue: found RDD[T] expected RDD[A]

2014-08-19 Thread Patrick McGloin

collection of types I had. Best regards, Patrick On 6 August 2014 07:58, Amit Kumar wrote: > Hi All, > > I am having some trouble trying to write generic code that uses sqlContext > and RDDs. Can you suggest what might be wrong? > > class SparkTable[T : ClassTag](val sqlConte

Re: Advantage of using cache()

2014-08-20 Thread Patrick Wendell

Your rdd2 and rdd3 differ in two ways so it's hard to track the exact effect of caching. In rdd3, in addition to the fact that rdd will be cached, you are also doing a bunch of extra random number generation. So it will be hard to isolate the effect of caching. On Wed, Aug 20, 2014 at 7:48 AM, Gr

Re: Broadcast vs simple variable

2014-08-20 Thread Patrick Wendell

For large objects, it will be more efficient to broadcast it. If your array is small it won't really matter. How many centers do you have? Unless you are finding that you have very large tasks (and Spark will print a warning about this), it could be okay to just reference it directly. On Wed, Aug

Re: Web UI doesn't show some stages

2014-08-20 Thread Patrick Wendell

The reason is that some operators get pipelined into a single stage. rdd.map(XX).filter(YY) - this executes in a single stage since there is no data movement needed in between these operations. If you call toDeubgString on the final RDD it will give you some information about the exact lineage. In

Re: Advantage of using cache()

2014-08-23 Thread Patrick Wendell

Yep - that's correct. As an optimization we save the shuffle output and re-use if if you execute a stage twice. So this can make A:B tests like this a bit confusing. - Patrick On Friday, August 22, 2014, Nieyuan wrote: > Because map-reduce tasks like join will save shuffle data to d

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-25 Thread Patrick Wendell

Hey Andrew, We might create a new JIRA for it, but it doesn't exist yet. We'll create JIRA's for the major 1.2 issues at the beginning of September. - Patrick On Mon, Aug 25, 2014 at 8:53 AM, Andrew Ash wrote: > Hi Patrick, > > For the spilling within on key work y

Submit to the "Powered By Spark" Page!

2014-08-26 Thread Patrick Wendell

any new entries here: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark - Patrick

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Patrick Wendell

Yeah - each batch will produce a new RDD. On Wed, Aug 27, 2014 at 3:33 PM, Soumitra Kumar wrote: > Thanks. > > Just to double check, rdd.id would be unique for a batch in a DStream? > > > On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng wrote: >> >> You can use RDD id as the seed, which is unique

Re: memory size for caching RDD

2014-09-03 Thread Patrick Wendell

Changing this is not supported, it si immutable similar to other spark configuration settings. On Wed, Sep 3, 2014 at 8:13 PM, 牛兆捷 wrote: > Dear all: > > Spark uses memory to cache RDD and the memory size is specified by > "spark.storage.memoryFraction". > > One the Executor starts, does Spark su

Re: Deployment model popularity - Standard vs. YARN vs. Mesos vs. SIMR

2014-09-07 Thread Patrick Wendell

I would say that the first three are all used pretty heavily. Mesos was the first one supported (long ago), the standalone is the simplest and most popular today, and YARN is newer but growing a lot in activity. SIMR is not used as much... it was designed mostly for environments where users had a

Announcing Spark 1.1.0!

2014-09-11 Thread Patrick Wendell

g. Thanks, and congratulations! - Patrick - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Patrick Wendell

[moving to user@] This would typically be accomplished with a union() operation. You can't mutate an RDD in-place, but you can create a new RDD with a union() which is an inexpensive operator. On Fri, Sep 12, 2014 at 5:28 AM, Archit Thakur wrote: > Hi, > > We have a use case where we are plannin

Re: Spark 1.1.0: Cannot load main class from JAR

2014-09-12 Thread Patrick Wendell

Hey SK, Yeah, the documented format is the same (we expect users to add the jar at the end) but the old spark-submit had a bug where it would actually accept inputs that did not match the documented format. Sorry if this was difficult to find! - Patrick On Fri, Sep 12, 2014 at 1:50 PM, SK

Re: spark-1.1.0 with make-distribution.sh problem

2014-09-14 Thread Patrick Wendell

Yeah that issue has been fixed by adding better docs, it just didn't make it in time for the release: https://github.com/apache/spark/blob/branch-1.1/make-distribution.sh#L54 On Thu, Sep 11, 2014 at 11:57 PM, Zhanfeng Huo wrote: > resolved： > > ./make-distribution.sh --name spark-hadoop-2.3.0

Re: partitioned groupBy

2014-09-16 Thread Patrick Wendell

If each partition can fit in memory, you can do this using mapPartitions and then building an inverse mapping within each partition. You'd need to construct a hash map within each partition yourself. On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya wrote: > I have a use case where my RDD is set up

Re: partitioned groupBy

2014-09-17 Thread Patrick Wendell

wrote: > Patrick, > > If I understand this correctly, I won't be able to do this in the closure > provided to mapPartitions() because that's going to be stateless, in the > sense that a hash map that I create within the closure would only be useful > for one call of MapPartitio

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Patrick Wendell

I agree, that's a good idea Marcelo. There isn't AFAIK any reason the client needs to hang there for correct operation. On Thu, Sep 18, 2014 at 9:39 AM, Marcelo Vanzin wrote: > Yes, what Sandy said. > > On top of that, I would suggest filing a bug for a new command line > argument for spark-submi

Re: Spot instances on Amazon EMR

2014-09-18 Thread Patrick Wendell

Hey Grzegorz, EMR is a service that is not maintained by the Spark community. So this list isn't the right place to ask EMR questions. - Patrick On Thu, Sep 18, 2014 at 3:19 AM, Grzegorz Białek wrote: > Hi, > I would like to run Spark application on Amazon EMR. I have some questi

Spark SQL + Hive + JobConf NoClassDefFoundError

2014-09-29 Thread Patrick McGloin

ar) but the Executor doesn't find the class. Here is the command: sudo ./spark-submit --class aac.main.SparkDriver --master spark://localhost:7077 --jars AAC-assembly-1.0.jar aacApp_2.10-1.0.jar Any pointers would be appreciated! Best regards, Patrick

Re: Spark SQL + Hive + JobConf NoClassDefFoundError

2014-10-01 Thread Patrick McGloin

FYI, in case anybody else has this problem, we switched to Spark 1.1 (outside CDH) and the same Spark application worked first time (once recompiled with Spark 1.1 libs of course). I assume this is because Spark 1.1 is compiled with Hive. On 29 September 2014 17:41, Patrick McGloin wrote: >

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-08 Thread Patrick Wendell

IIRC - the random is seeded with the index, so it will always produce the same result for the same index. Maybe I don't totally follow though. Could you give a small example of how this might change the RDD ordering in a way that you don't expect? In general repartition() will not preserve the orde

Re: sparksql connect remote hive cluster

2014-10-08 Thread Patrick Wendell

Spark will need to connect both to the hive metastore and to all HDFS nodes (NN and DN's). If that is all in place then it should work. In this case it looks like maybe it can't connect to a datanode in HDFS to get the raw data. Keep in mind that the performance might not be very good if you are tr

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-22 Thread Patrick Wendell

maven it's more clunky but if you do a "mvn install" first then (I think) you can test sub-modules independently: mvn test -pl streaming ... - Patrick On Wed, Oct 22, 2014 at 10:00 PM, Ryan Williams wrote: > I started building Spark / running Spark tests this weekend and on

Re: About "Memory usage" in the Spark UI

2014-10-22 Thread Patrick Wendell

It shows the amount of memory used to store RDD blocks, which are created when you run .cache()/.persist() on an RDD. On Wed, Oct 22, 2014 at 10:07 PM, Haopu Wang wrote: > Hi, please take a look at the attached screen-shot. I wonders what's the > "Memory Used" column mean. > > > > I give 2GB me

Fwd: [akka-user] Akka Camel plus Spark Streaming

2014-10-27 Thread Patrick McGloin

orks. When deployed to the Spark Cluster the following error is logged by the worker who tries to use Akka Camel: -- Forwarded message -- From: Patrick McGloin Date: 24 October 2014 15:09 Subject: Re: [akka-user] Akka Camel plus Spark Streaming To: akka-u...@googlegroups.com Hi

Re: [akka-user] Akka Camel plus Spark Streaming

2014-10-27 Thread Patrick McGloin

is in the assembled jar file. Please see the mails below, which I sent to the Akka group for details. Is there something I am doing wrong? Is there a way to get the Akka Cluster to load the reference.conf from Camel? Any help greatly appreciated! Best regards, Patrick On 27 October 2014 11:3

Re: Support Hive 0.13 .1 in Spark SQL

2014-10-27 Thread Patrick Wendell

://issues.apache.org/jira/browse/SPARK-4114 This is a very important issue for Spark SQL, so I'd welcome comments on that JIRA from anyone who is familiar with Hive/HCatalog internals. - Patrick On Mon, Oct 27, 2014 at 9:54 PM, Cheng, Hao wrote: > Hi, all > >I have some PRs

Re: Ending a job early

2014-10-28 Thread Patrick Wendell

n one or two cases we've exposed functions that rely on this: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L334 I would expect more robust support for online aggregation to show up in a future version of Spark. - Patrick On T

Re: Still struggling with building documentation

2014-11-11 Thread Patrick Wendell

The doc build appears to be broken in master. We'll get it patched up before the release: https://issues.apache.org/jira/browse/SPARK-4326 On Tue, Nov 11, 2014 at 10:50 AM, Alessandro Baretta wrote: > Nichols and Patrick, > > Thanks for your help, but, no, it still does not wo

Re: Spark and Play

2014-11-11 Thread Patrick Wendell

Hi There, Because Akka versions are not binary compatible with one another, it might not be possible to integrate Play with Spark 1.1.0. - Patrick On Tue, Nov 11, 2014 at 8:21 AM, Akshat Aranya wrote: > Hi, > > Sorry if this has been asked before; I didn't find a satisfactory

Re: toLocalIterator in Spark 1.0.0

2014-11-13 Thread Patrick Wendell

It looks like you are trying to directly import the toLocalIterator function. You can't import functions, it should just appear as a method of an existing RDD if you have one. - Patrick On Thu, Nov 13, 2014 at 10:21 PM, Deep Pradhan wrote: > Hi, > > I am using Spark 1.0.0 an

Question about resource sharing in Spark Standalone

2014-11-23 Thread Patrick Liu

Dear all, Currently, I am running spark standalone cluster with ~100 nodes. Multiple users can connect to the cluster by Spark-shell or PyShell. However, I can't find an efficient way to control the resources among multiple users. I can set "spark.deploy.defaultCores" in the server side to lim

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-26 Thread Patrick Wendell

hould not do this. - Patrick On Wed, Nov 26, 2014 at 1:45 AM, Judy Nash wrote: > Looks like a config issue. I ran spark-pi job and still failing with the > same guava error > > Command ran: > > .\bin\spark-class.cmd org.apache.spark.deploy.SparkSubmit --class > org.apa

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-26 Thread Patrick Wendell

uot;org/spark-project/guava/common/base/Preconitions".checkArgument:(ZLjava/lang/Object;)V 50: invokestatic #502// Method "org/spark-project/guava/common/base/Preconitions".checkArgument:(ZLjava/lang/Object;)V On Wed, Nov 26, 2014 at 11:08 AM, Patri

Re: Opening Spark on IntelliJ IDEA

2014-11-29 Thread Patrick Wendell

I recently posted instructions on loading Spark in Intellij from scratch: https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-BuildingSparkinIntelliJIDEA You need to do a few extra steps for the YARN project to work. Also, for questions like this that re

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-30 Thread Patrick Wendell

asses present it can cause issues. On Sun, Nov 30, 2014 at 10:53 PM, Judy Nash wrote: > Thanks Patrick and Cheng for the suggestions. > > The issue was Hadoop common jar was added to a classpath. After I removed > Hadoop common jar from both master and slave, I was able to bypass the

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Patrick Wendell

Thanks for flagging this. I reverted the relevant YARN fix in Spark 1.2 release. We can try to debug this in master. On Thu, Dec 4, 2014 at 9:51 PM, Jianshi Huang wrote: > I created a ticket for this: > > https://issues.apache.org/jira/browse/SPARK-4757 > > > Jianshi > > On Fri, Dec 5, 2014 at

Re: spark streaming kafa best practices ?

2014-12-05 Thread Patrick Wendell

The second choice is better. Once you call collect() you are pulling all of the data onto a single node, you want to do most of the processing in parallel on the cluster, which is what map() will do. Ideally you'd try to summarize the data or reduce it before calling collect(). On Fri, Dec 5, 201

Re: Stateful mapPartitions

2014-12-05 Thread Patrick Wendell

Yeah the main way to do this would be to have your own static cache of connections. These could be using an object in Scala or just a static variable in Java (for instance a set of connections that you can borrow from). - Patrick On Thu, Dec 4, 2014 at 5:26 PM, Tobias Pfeiffer wrote: >

Re: Spark Server - How to implement

2014-12-12 Thread Patrick Wendell

various types of execution services for spark apps. - Patrick On Fri, Dec 12, 2014 at 10:06 AM, Manoj Samel wrote: > Thanks Marcelo. > > Spark Gurus/Databricks team - do you have something in roadmap for such a > spark server ? > > Thanks, > > On Thu, Dec 11, 2014 at 5:43 P

Re: spark streaming kafa best practices ?

2014-12-17 Thread Patrick Wendell

is intended to produce a side effect and map for something that will return a new dataset. On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas wrote: > Patrick, > > I was wondering why one would choose for rdd.map vs rdd.foreach to execute a > side-effecting function on an RDD. > > -kr, Gera

Announcing Spark 1.2!

2014-12-19 Thread Patrick Wendell

I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is the third release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 172 developers and more than 1,000 commits! This release brings operational and performance improvements in Spark core

Re: Announcing Spark 1.2!

2014-12-19 Thread Patrick Wendell

2.0 and v1.2.0-rc2 are pointed to different commits in >> https://github.com/apache/spark/releases >> >> Best Regards, >> >> Shixiong Zhu >> >> 2014-12-19 16:52 GMT+08:00 Patrick Wendell : >>> >>> I'm happy to announce the availability of S

Re: Announcing Spark Packages

2014-12-22 Thread Patrick Wendell

Xiangrui asked me to report that it's back and running :) On Mon, Dec 22, 2014 at 3:21 PM, peng wrote: > Me 2 :) > > > On 12/22/2014 06:14 PM, Andrew Ash wrote: > > Hi Xiangrui, > > That link is currently returning a 503 Over Quota error message. Would you > mind pinging back out when the page i

Re: Announcing Spark Packages

2014-12-22 Thread Patrick Wendell

Hey Nick, I think Hitesh was just trying to be helpful and point out the policy - not necessarily saying there was an issue. We've taken a close look at this and I think we're in good shape her vis-a-vis this policy. - Patrick On Mon, Dec 22, 2014 at 5:29 PM, Nicholas Chammas wrote

Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell

Is it sufficient to set "spark.hadoop.validateOutputSpecs" to false? http://spark.apache.org/docs/latest/configuration.html - Patrick On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai wrote: > Hi, > > > > We have such requirements to save RDD output to HDFS with saveA

Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell

ble as any alternatives. This is already pretty easy IMO. - Patrick On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao wrote: > I am wondering if we can provide more friendly API, other than configuration > for this purpose. What do you think Patrick? > > Cheng Hao > > -Original

Re: Long-running job cleanup

2014-12-28 Thread Patrick Wendell

longer be referenced. If you are seeing a large build up of shuffle data, it's possible you are retaining references to older RDDs inadvertently. Could you explain what your job actually doing? - Patrick On Mon, Dec 22, 2014 at 2:36 PM, Ganelin, Ilya wrote: > Hi all, I have a long running jo

Re: action progress in ipython notebook?

2014-12-28 Thread Patrick Wendell

Hey Eric, I'm just curious - which specific features in 1.2 do you find most help with usability? This is a theme we're focusing on for 1.3 as well, so it's helpful to hear what makes a difference. - Patrick On Sun, Dec 28, 2014 at 1:36 AM, Eric Friedman wrote: > Hi Josh,

Re: Accumulator value in Spark UI

2015-01-14 Thread Patrick Wendell

It should appear in the page for any stage in which accumulators are updated. On Wed, Jan 14, 2015 at 6:46 PM, Justin Yip wrote: > Hello, > > From accumulator documentation, it says that if the accumulator is named, it > will be displayed in the WebUI. However, I cannot find it anywhere. > > Do I

Re: Bouncing Mails

2015-01-17 Thread Patrick Wendell

Akhil, Those are handled by ASF infrastructure, not anyone in the Spark project. So this list is not the appropriate place to ask for help. - Patrick On Sat, Jan 17, 2015 at 12:56 AM, Akhil Das wrote: > My mails to the mailing list are getting rejected, have opened a Jira issue, > can s

Re: spark-ec2 login expects at least 1 slave

2014-03-01 Thread Patrick Wendell

Yep, currently it only supports running at least 1 slave. On Sat, Mar 1, 2014 at 4:47 PM, nicholas.chammas wrote: > I successfully launched a Spark EC2 "cluster" with 0 slaves using spark-ec2. > When trying to login to the master node with spark-ec2 login, I get the > following: > > Searching for

Re: Unable to redirect Spark logs to slf4j

2014-03-05 Thread Patrick Wendell

Spark with this batch and seeing if it works that would be great. Thanks, Patrick On Wed, Mar 5, 2014 at 10:26 AM, Paul Brown wrote: > > Hi, Sergey -- > > Here's my recipe, implemented via Maven; YMMV if you need to do it via sbt, > etc., but it should

Re: Unable to redirect Spark logs to slf4j

2014-03-05 Thread Patrick Wendell

ssic/1.1.1 - Patrick On Wed, Mar 5, 2014 at 1:52 PM, Sergey Parhomenko wrote: > Hi Patrick, > > Thanks for the patch. I tried building a patched version of > spark-core_2.10-0.9.0-incubating.jar but the Maven build fails: > [ERROR] > /home/das/Work/thx/incubator-spark/core/src

Re: Python 2.7 + numpy break sortByKey()

2014-03-06 Thread Patrick Wendell

The difference between your two jobs is that take() is optimized and only runs on the machine where you are using the shell, whereas sortByKey requires using many machines. It seems like maybe python didn't get upgraded correctly on one of the slaves. I would look in the /root/spark/work/ folder (f

Re: Kryo serialization does not compress

2014-03-06 Thread Patrick Wendell

Hey There, This is interesting... thanks for sharing this. If you are storing in MEMORY_ONLY then you are just directly storing Java objects in the JVM. So they can't be compressed because they aren't really stored in a known format it's just left up to the JVM. To answer you other question, it's

Re: no stdout output from worker

2014-03-09 Thread Patrick Wendell

hines. If you see stderr but not stdout that's a bit of a puzzler since they both go through the same mechanism. - Patrick On Sun, Mar 9, 2014 at 2:32 PM, Sen, Ranjan [USA] wrote: > Hi > I have some System.out.println in my Java code that is working ok in a local > environment. But

Re: [External] Re: no stdout output from worker

2014-03-10 Thread Patrick Wendell

Hey Sen, Suarav is right, and I think all of your print statements are inside of the driver program rather than inside of a closure. How are you running your program (i.e. what do you run that starts this job)? Where you run the driver you should expect to see the output. - Patrick On Mon, Mar

Re: "Too many open files" exception on reduceByKey

2014-03-10 Thread Patrick Wendell

x27;t change so it won't help the ulimit problem. This means you'll have to use fewer reducers (e.g. pass reduceByKey a number of reducers) or use fewer cores on each machine. - Patrick On Mon, Mar 10, 2014 at 10:41 AM, Matthew Cheah wrote: > Hi everyone, > > My team (cc'

Re: Block

2014-03-11 Thread Patrick Wendell

A block is an internal construct that isn't directly exposed to users. Internally though, each partition of an RDD is mapped to one block. - Patrick On Mon, Mar 10, 2014 at 11:06 PM, David Thomas wrote: > What is the concept of Block and BlockManager in Spark? How is a Block > r

Re: building Spark docs

2014-03-12 Thread Patrick Wendell

Dianna I'm forwarding this to the dev list since it might be useful there as well. On Wed, Mar 12, 2014 at 11:39 AM, Diana Carroll wrote: > Hi all. I needed to build the Spark docs. The basic instructions to do > this are in spark/docs/README.md but it took me quite a bit of playing > around to

Re: Changing number of workers for benchmarking purposes

2014-03-12 Thread Patrick Wendell

is: for slave in `cat "$HOSTLIST"|sed "s/#.*$//;/^$/d"`; do to this for slave in `cat "$HOSTLIST"| head -n $NUM_SLAVES | sed "s/#.*$//;/^$/d"`; do Then you could just set NUM_SLAVES before you stop/start. Not sure if this helps much but maybe it'

< 1 2 3 4 >

201 - 300 of 386 matches

Mail list logo