Re: Wildcard support in input path

2014-06-17 Thread Patrick Wendell
These paths get passed directly to the Hadoop FileSystem API and I think the support globbing out-of-the box. So AFAIK it should just work. On Tue, Jun 17, 2014 at 9:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi Jianshi, I have used wild card characters (*) in my program and it

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-17 Thread Patrick Wendell
Out of curiosity - are you guys using speculation, shuffle consolidation, or any other non-default option? If so that would help narrow down what's causing this corruption. On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Matt/Ryan, Did you make any headway

Re: Trailing Tasks Saving to HDFS

2014-06-19 Thread Patrick Wendell
I'll make a comment on the JIRA - thanks for reporting this, let's get to the bottom of it. On Thu, Jun 19, 2014 at 11:19 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: I've created an issue for this but if anyone has any advice, please let me know. Basically, on about 10 GBs of

Re: 1.0.1 release plan

2014-06-20 Thread Patrick Wendell
Hey There, I'd like to start voting on this release shortly because there are a few important fixes that have queued up. We're just waiting to fix an akka issue. I'd guess we'll cut a vote in the next few days. - Patrick On Thu, Jun 19, 2014 at 10:47 AM, Mingyu Kim m...@palantir.com wrote: Hi

Re: hadoop + yarn + spark

2014-06-27 Thread Patrick Wendell
Hi There, There is an issue with PySpark-on-YARN that requires users build with Java 6. The issue has to do with how Java 6 and 7 package jar files differently. Can you try building spark with Java 6 and trying again? - Patrick On Fri, Jun 27, 2014 at 5:00 PM, sdeb sangha...@gmail.com wrote

Re: How to clear the list of Completed Appliations in Spark web UI?

2014-07-09 Thread Patrick Wendell
There isn't currently a way to do this, but it will start dropping older applications once more than 200 are stored. On Wed, Jul 9, 2014 at 4:04 PM, Haopu Wang hw...@qilinsoft.com wrote: Besides restarting the Master, is there any other way to clear the Completed Applications in Master web UI?

Re: Purpose of spark-submit?

2014-07-09 Thread Patrick Wendell
It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of

Re: issues with ./bin/spark-shell for standalone mode

2014-07-09 Thread Patrick Wendell
Hey Mikhail, I think (hope?) the -em and -dm options were never in an official Spark release. They were just in the master branch at some point. Did you use these during a previous Spark release or were you just on master? - Patrick On Wed, Jul 9, 2014 at 9:18 AM, Mikhail Strebkov streb

Announcing Spark 1.0.1

2014-07-11 Thread Patrick Wendell
I am happy to announce the availability of Spark 1.0.1! This release includes contributions from 70 developers. Spark 1.0.0 includes fixes across several areas of Spark, including the core API, PySpark, and MLlib. It also includes new features in Spark's (alpha) SQL library, including support for

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Patrick Wendell
Adding new build modules is pretty high overhead, so if this is a case where a small amount of duplicated code could get rid of the dependency, that could also be a good short-term option. - Patrick On Mon, Jul 14, 2014 at 2:15 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, I'd just add

Re: how to publish spark inhouse?

2014-07-28 Thread Patrick Wendell
All of the scripts we use to publish Spark releases are in the Spark repo itself, so you could follow these as a guideline. The publishing process in Maven is similar to in SBT: https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L65 On Mon, Jul 28, 2014 at 12:39 PM,

Spark SQL, Parquet and Impala

2014-08-01 Thread Patrick McGloin
Hi, We would like to use Spark SQL to store data in Parquet format and then query that data using Impala. We've tried to come up with a solution and it is working but it doesn't seem good. So I was wondering if you guys could tell us what is the correct way to do this. We are using Spark 1.0

Re: Spark SQL, Parquet and Impala

2014-08-01 Thread Patrick McGloin
insert data from SparkSQL into a Parquet table which can be directly queried by Impala? Best regards, Patrick On 1 August 2014 16:18, Patrick McGloin mcgloin.patr...@gmail.com wrote: Hi, We would like to use Spark SQL to store data in Parquet format and then query that data using Impala

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Patrick Wendell
This is a Scala bug - I filed something upstream, hopefully they can fix it soon and/or we can provide a work around: https://issues.scala-lang.org/browse/SI-8772 - Patrick On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote: Currently scala 2.10.2 can't be pulled in from

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Patrick Wendell
I've had intermiddent access to the artifacts themselves, but for me the directory listing always 404's. I think if sbt hits a 404 on the directory, it sends a somewhat confusing error message that it can't download the artifact. - Patrick On Fri, Aug 1, 2014 at 3:28 PM, Shivaram Venkataraman

Re: Spark SQL, Parquet and Impala

2014-08-02 Thread Patrick McGloin
of the best practice for loading data into Parquet tables. Is the way we are doing the Spark part correct in your opinion? Best regards, Patrick On 1 August 2014 19:32, Michael Armbrust mich...@databricks.com wrote: So is the only issue that impala does not see changes until you refresh

Re: disable log4j for spark-shell

2014-08-03 Thread Patrick Wendell
If you want to customize the logging behavior - the simplest way is to copy conf/log4j.properties.tempate to conf/log4j.properties. Then you can go and modify the log level in there. The spark shells should pick this up. On Sun, Aug 3, 2014 at 6:16 AM, Sean Owen so...@cloudera.com wrote:

Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-04 Thread Patrick Wendell
/spark/pull/1165 A (potential) workaround would be to first persist your data to disk, then re-partition it, then cache it. I'm not 100% sure whether that will work though. val a = sc.textFile(s3n://some-path/*.json).persist(DISK_ONLY).repartition(larger nr of partitions).cache() - Patrick On Fri

Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-04 Thread Patrick Wendell
BTW - the reason why the workaround could help is because when persisting to DISK_ONLY, we explicitly avoid materializing the RDD partition in memory... we just pass it through to disk On Mon, Aug 4, 2014 at 1:10 AM, Patrick Wendell pwend...@gmail.com wrote: It seems possible that you

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Patrick Wendell
For hortonworks, I believe it should work to just link against the corresponding upstream version. I.e. just set the Hadoop version to 2.4.0 Does that work? - Patrick On Mon, Aug 4, 2014 at 12:13 AM, Ron's Yahoo! zlgonza...@yahoo.com.invalid wrote: Hi, Not sure whose issue

Re: Cached RDD Block Size - Uneven Distribution

2014-08-04 Thread Patrick Wendell
Are you directly caching files from Hadoop or are you doing some transformation on them first? If you are doing a groupBy or some type of transformation, then you could be causing data skew that way. On Sun, Aug 3, 2014 at 1:19 PM, iramaraju iramar...@gmail.com wrote: I am running spark 1.0.0,

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-05 Thread Patrick Wendell
out sequentially on disk on one big file, you can call `sortByKey` with a hashed suffix as well. The sort functions are externalized in Spark 1.1 (which is in pre-release). - Patrick On Tue, Aug 5, 2014 at 2:39 PM, Jens Kristian Geyti sp...@jkg.dk wrote: Patrick Wendell wrote In the latest

Re: type issue: found RDD[T] expected RDD[A]

2014-08-19 Thread Patrick McGloin
for a collection of types I had. Best regards, Patrick On 6 August 2014 07:58, Amit Kumar kumarami...@gmail.com wrote: Hi All, I am having some trouble trying to write generic code that uses sqlContext and RDDs. Can you suggest what might be wrong? class SparkTable[T : ClassTag](val

Re: Advantage of using cache()

2014-08-20 Thread Patrick Wendell
Your rdd2 and rdd3 differ in two ways so it's hard to track the exact effect of caching. In rdd3, in addition to the fact that rdd will be cached, you are also doing a bunch of extra random number generation. So it will be hard to isolate the effect of caching. On Wed, Aug 20, 2014 at 7:48 AM,

Re: Broadcast vs simple variable

2014-08-20 Thread Patrick Wendell
For large objects, it will be more efficient to broadcast it. If your array is small it won't really matter. How many centers do you have? Unless you are finding that you have very large tasks (and Spark will print a warning about this), it could be okay to just reference it directly. On Wed,

Re: Web UI doesn't show some stages

2014-08-20 Thread Patrick Wendell
The reason is that some operators get pipelined into a single stage. rdd.map(XX).filter(YY) - this executes in a single stage since there is no data movement needed in between these operations. If you call toDeubgString on the final RDD it will give you some information about the exact lineage.

Re: Advantage of using cache()

2014-08-23 Thread Patrick Wendell
Yep - that's correct. As an optimization we save the shuffle output and re-use if if you execute a stage twice. So this can make A:B tests like this a bit confusing. - Patrick On Friday, August 22, 2014, Nieyuan qiushuiwuh...@gmail.com wrote: Because map-reduce tasks like join will save

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-25 Thread Patrick Wendell
Hey Andrew, We might create a new JIRA for it, but it doesn't exist yet. We'll create JIRA's for the major 1.2 issues at the beginning of September. - Patrick On Mon, Aug 25, 2014 at 8:53 AM, Andrew Ash and...@andrewash.com wrote: Hi Patrick, For the spilling within on key work you mention

Submit to the Powered By Spark Page!

2014-08-26 Thread Patrick Wendell
: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark - Patrick

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Patrick Wendell
Yeah - each batch will produce a new RDD. On Wed, Aug 27, 2014 at 3:33 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Thanks. Just to double check, rdd.id would be unique for a batch in a DStream? On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote: You can use RDD

Re: memory size for caching RDD

2014-09-03 Thread Patrick Wendell
Changing this is not supported, it si immutable similar to other spark configuration settings. On Wed, Sep 3, 2014 at 8:13 PM, 牛兆捷 nzjem...@gmail.com wrote: Dear all: Spark uses memory to cache RDD and the memory size is specified by spark.storage.memoryFraction. One the Executor starts,

Re: Deployment model popularity - Standard vs. YARN vs. Mesos vs. SIMR

2014-09-07 Thread Patrick Wendell
I would say that the first three are all used pretty heavily. Mesos was the first one supported (long ago), the standalone is the simplest and most popular today, and YARN is newer but growing a lot in activity. SIMR is not used as much... it was designed mostly for environments where users had

Announcing Spark 1.1.0!

2014-09-11 Thread Patrick Wendell
, and congratulations! - Patrick - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Patrick Wendell
[moving to user@] This would typically be accomplished with a union() operation. You can't mutate an RDD in-place, but you can create a new RDD with a union() which is an inexpensive operator. On Fri, Sep 12, 2014 at 5:28 AM, Archit Thakur archit279tha...@gmail.com wrote: Hi, We have a use

Re: Spark 1.1.0: Cannot load main class from JAR

2014-09-12 Thread Patrick Wendell
Hey SK, Yeah, the documented format is the same (we expect users to add the jar at the end) but the old spark-submit had a bug where it would actually accept inputs that did not match the documented format. Sorry if this was difficult to find! - Patrick On Fri, Sep 12, 2014 at 1:50 PM, SK

Re: spark-1.1.0 with make-distribution.sh problem

2014-09-14 Thread Patrick Wendell
Yeah that issue has been fixed by adding better docs, it just didn't make it in time for the release: https://github.com/apache/spark/blob/branch-1.1/make-distribution.sh#L54 On Thu, Sep 11, 2014 at 11:57 PM, Zhanfeng Huo huozhanf...@gmail.com wrote: resolved: ./make-distribution.sh --name

Re: partitioned groupBy

2014-09-16 Thread Patrick Wendell
If each partition can fit in memory, you can do this using mapPartitions and then building an inverse mapping within each partition. You'd need to construct a hash map within each partition yourself. On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya aara...@gmail.com wrote: I have a use case where

Re: partitioned groupBy

2014-09-17 Thread Patrick Wendell
...@gmail.com wrote: Patrick, If I understand this correctly, I won't be able to do this in the closure provided to mapPartitions() because that's going to be stateless, in the sense that a hash map that I create within the closure would only be useful for one call of MapPartitionsRDD.compute(). I

Re: Spot instances on Amazon EMR

2014-09-18 Thread Patrick Wendell
Hey Grzegorz, EMR is a service that is not maintained by the Spark community. So this list isn't the right place to ask EMR questions. - Patrick On Thu, Sep 18, 2014 at 3:19 AM, Grzegorz Białek grzegorz.bia...@codilime.com wrote: Hi, I would like to run Spark application on Amazon EMR. I have

Spark SQL + Hive + JobConf NoClassDefFoundError

2014-09-29 Thread Patrick McGloin
doesn't find the class. Here is the command: sudo ./spark-submit --class aac.main.SparkDriver --master spark://localhost:7077 --jars AAC-assembly-1.0.jar aacApp_2.10-1.0.jar Any pointers would be appreciated! Best regards, Patrick

Re: Spark SQL + Hive + JobConf NoClassDefFoundError

2014-10-01 Thread Patrick McGloin
FYI, in case anybody else has this problem, we switched to Spark 1.1 (outside CDH) and the same Spark application worked first time (once recompiled with Spark 1.1 libs of course). I assume this is because Spark 1.1 is compiled with Hive. On 29 September 2014 17:41, Patrick McGloin mcgloin.patr

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-08 Thread Patrick Wendell
IIRC - the random is seeded with the index, so it will always produce the same result for the same index. Maybe I don't totally follow though. Could you give a small example of how this might change the RDD ordering in a way that you don't expect? In general repartition() will not preserve the

Re: sparksql connect remote hive cluster

2014-10-08 Thread Patrick Wendell
Spark will need to connect both to the hive metastore and to all HDFS nodes (NN and DN's). If that is all in place then it should work. In this case it looks like maybe it can't connect to a datanode in HDFS to get the raw data. Keep in mind that the performance might not be very good if you are

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-23 Thread Patrick Wendell
do a mvn install first then (I think) you can test sub-modules independently: mvn test -pl streaming ... - Patrick On Wed, Oct 22, 2014 at 10:00 PM, Ryan Williams ryan.blake.willi...@gmail.com wrote: I started building Spark / running Spark tests this weekend and on maybe 5-10 occasions have run

Re: About Memory usage in the Spark UI

2014-10-23 Thread Patrick Wendell
It shows the amount of memory used to store RDD blocks, which are created when you run .cache()/.persist() on an RDD. On Wed, Oct 22, 2014 at 10:07 PM, Haopu Wang hw...@qilinsoft.com wrote: Hi, please take a look at the attached screen-shot. I wonders what's the Memory Used column mean. I

Fwd: [akka-user] Akka Camel plus Spark Streaming

2014-10-27 Thread Patrick McGloin
the following error is logged by the worker who tries to use Akka Camel: -- Forwarded message -- From: Patrick McGloin mcgloin.patr...@gmail.com Date: 24 October 2014 15:09 Subject: Re: [akka-user] Akka Camel plus Spark Streaming To: akka-u...@googlegroups.com Hi Patrik, Thanks

Re: [akka-user] Akka Camel plus Spark Streaming

2014-10-27 Thread Patrick McGloin
it is in the assembled jar file. Please see the mails below, which I sent to the Akka group for details. Is there something I am doing wrong? Is there a way to get the Akka Cluster to load the reference.conf from Camel? Any help greatly appreciated! Best regards, Patrick On 27 October 2014 11:33, Patrick

Re: Support Hive 0.13 .1 in Spark SQL

2014-10-28 Thread Patrick Wendell
/browse/SPARK-4114 This is a very important issue for Spark SQL, so I'd welcome comments on that JIRA from anyone who is familiar with Hive/HCatalog internals. - Patrick On Mon, Oct 27, 2014 at 9:54 PM, Cheng, Hao hao.ch...@intel.com wrote: Hi, all I have some PRs blocked by hive upgrading

Re: Ending a job early

2014-10-28 Thread Patrick Wendell
or two cases we've exposed functions that rely on this: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L334 I would expect more robust support for online aggregation to show up in a future version of Spark. - Patrick On Tue, Oct 28

Re: Still struggling with building documentation

2014-11-11 Thread Patrick Wendell
The doc build appears to be broken in master. We'll get it patched up before the release: https://issues.apache.org/jira/browse/SPARK-4326 On Tue, Nov 11, 2014 at 10:50 AM, Alessandro Baretta alexbare...@gmail.com wrote: Nichols and Patrick, Thanks for your help, but, no, it still does

Re: Spark and Play

2014-11-11 Thread Patrick Wendell
Hi There, Because Akka versions are not binary compatible with one another, it might not be possible to integrate Play with Spark 1.1.0. - Patrick On Tue, Nov 11, 2014 at 8:21 AM, Akshat Aranya aara...@gmail.com wrote: Hi, Sorry if this has been asked before; I didn't find a satisfactory

Re: toLocalIterator in Spark 1.0.0

2014-11-13 Thread Patrick Wendell
It looks like you are trying to directly import the toLocalIterator function. You can't import functions, it should just appear as a method of an existing RDD if you have one. - Patrick On Thu, Nov 13, 2014 at 10:21 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am using Spark 1.0.0

Question about resource sharing in Spark Standalone

2014-11-23 Thread Patrick Liu
Dear all, Currently, I am running spark standalone cluster with ~100 nodes. Multiple users can connect to the cluster by Spark-shell or PyShell. However, I can't find an efficient way to control the resources among multiple users. I can set spark.deploy.defaultCores in the server side to

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-26 Thread Patrick Wendell
not do this. - Patrick On Wed, Nov 26, 2014 at 1:45 AM, Judy Nash judyn...@exchange.microsoft.com wrote: Looks like a config issue. I ran spark-pi job and still failing with the same guava error Command ran: .\bin\spark-class.cmd org.apache.spark.deploy.SparkSubmit --class

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-26 Thread Patrick Wendell
/Preconitions.checkArgument:(ZLjava/lang/Object;)V 50: invokestatic #502// Method org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V On Wed, Nov 26, 2014 at 11:08 AM, Patrick Wendell pwend...@gmail.com wrote: Hi Judy, Are you somehow

Re: Opening Spark on IntelliJ IDEA

2014-11-29 Thread Patrick Wendell
I recently posted instructions on loading Spark in Intellij from scratch: https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-BuildingSparkinIntelliJIDEA You need to do a few extra steps for the YARN project to work. Also, for questions like this that

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-30 Thread Patrick Wendell
present it can cause issues. On Sun, Nov 30, 2014 at 10:53 PM, Judy Nash judyn...@exchange.microsoft.com wrote: Thanks Patrick and Cheng for the suggestions. The issue was Hadoop common jar was added to a classpath. After I removed Hadoop common jar from both master and slave, I was able

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Patrick Wendell
Thanks for flagging this. I reverted the relevant YARN fix in Spark 1.2 release. We can try to debug this in master. On Thu, Dec 4, 2014 at 9:51 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: I created a ticket for this: https://issues.apache.org/jira/browse/SPARK-4757 Jianshi On Fri,

Re: Stateful mapPartitions

2014-12-05 Thread Patrick Wendell
Yeah the main way to do this would be to have your own static cache of connections. These could be using an object in Scala or just a static variable in Java (for instance a set of connections that you can borrow from). - Patrick On Thu, Dec 4, 2014 at 5:26 PM, Tobias Pfeiffer t...@preferred.jp

Re: Spark Server - How to implement

2014-12-12 Thread Patrick Wendell
. - Patrick On Fri, Dec 12, 2014 at 10:06 AM, Manoj Samel manojsamelt...@gmail.com wrote: Thanks Marcelo. Spark Gurus/Databricks team - do you have something in roadmap for such a spark server ? Thanks, On Thu, Dec 11, 2014 at 5:43 PM, Marcelo Vanzin van...@cloudera.com wrote: Oops, sorry

Re: spark streaming kafa best practices ?

2014-12-17 Thread Patrick Wendell
to produce a side effect and map for something that will return a new dataset. On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas gerard.m...@gmail.com wrote: Patrick, I was wondering why one would choose for rdd.map vs rdd.foreach to execute a side-effecting function on an RDD. -kr, Gerard. On Sat, Dec 6

Announcing Spark 1.2!

2014-12-19 Thread Patrick Wendell
I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is the third release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 172 developers and more than 1,000 commits! This release brings operational and performance improvements in Spark

Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
Is it sufficient to set spark.hadoop.validateOutputSpecs to false? http://spark.apache.org/docs/latest/configuration.html - Patrick On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi, We have such requirements to save RDD output to HDFS with saveAsTextFile like

Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
alternatives. This is already pretty easy IMO. - Patrick On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao hao.ch...@intel.com wrote: I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick? Cheng Hao -Original Message- From

Re: Long-running job cleanup

2014-12-28 Thread Patrick Wendell
be referenced. If you are seeing a large build up of shuffle data, it's possible you are retaining references to older RDDs inadvertently. Could you explain what your job actually doing? - Patrick On Mon, Dec 22, 2014 at 2:36 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi all, I have a long running

Re: action progress in ipython notebook?

2014-12-28 Thread Patrick Wendell
Hey Eric, I'm just curious - which specific features in 1.2 do you find most help with usability? This is a theme we're focusing on for 1.3 as well, so it's helpful to hear what makes a difference. - Patrick On Sun, Dec 28, 2014 at 1:36 AM, Eric Friedman eric.d.fried...@gmail.com wrote: Hi

Re: Bouncing Mails

2015-01-17 Thread Patrick Wendell
Akhil, Those are handled by ASF infrastructure, not anyone in the Spark project. So this list is not the appropriate place to ask for help. - Patrick On Sat, Jan 17, 2015 at 12:56 AM, Akhil Das ak...@sigmoidanalytics.com wrote: My mails to the mailing list are getting rejected, have opened

Re: Accumulator value in Spark UI

2015-01-14 Thread Patrick Wendell
It should appear in the page for any stage in which accumulators are updated. On Wed, Jan 14, 2015 at 6:46 PM, Justin Yip yipjus...@prediction.io wrote: Hello, From accumulator documentation, it says that if the accumulator is named, it will be displayed in the WebUI. However, I cannot find

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-12 Thread Patrick Wendell
partition. - Patrick On Wed, Feb 11, 2015 at 9:37 PM, fightf...@163.com fightf...@163.com wrote: Hi, Really have no adequate solution got for this issue. Expecting any available analytical rules or hints. Thanks, Sun. fightf...@163.com From: fightf

Re: Questions about Spark standalone resource scheduler

2015-02-02 Thread Patrick Wendell
Hadoop stack via something like YARN. - Patrick On Mon, Feb 2, 2015 at 12:24 AM, Shao, Saisai saisai.s...@intel.com wrote: Hi all, I have some questions about the future development of Spark's standalone resource scheduler. We've heard some users have the requirements to have multi-tenant

Re: feeding DataFrames into predictive algorithms

2015-02-11 Thread Patrick Wendell
I think there is a minor error here in that the first example needs a tail after the seq: df.map { row = (row.getDouble(0), row.toSeq.tail.map(_.asInstanceOf[Double])) }.toDataFrame(label, features) On Wed, Feb 11, 2015 at 7:46 PM, Michael Armbrust mich...@databricks.com wrote: It sounds like

Re: Spark v1.2.1 failing under BigTop build in External Flume Sink (due to missing Netty library)

2015-03-05 Thread Patrick Wendell
You may need to add the -Phadoop-2.4 profile. When building or release packages for Hadoop 2.4 we use the following flags: -Phadoop-2.4 -Phive -Phive-thriftserver -Pyarn - Patrick On Thu, Mar 5, 2015 at 12:47 PM, Kelly, Jonathan jonat...@amazon.com wrote: I confirmed that this has nothing

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Patrick Wendell
We don't support expressions or wildcards in that configuration. For each application, the local directories need to be constant. If you have users submitting different Spark applications, those can each set spark.local.dirs. - Patrick On Wed, Mar 11, 2015 at 12:14 AM, Jianshi Huang jianshi.hu

[ANNOUNCE] Announcing Spark 1.3!

2015-03-13 Thread Patrick Wendell
Hi All, I'm happy to announce the availability of Spark 1.3.0! Spark 1.3.0 is the fourth release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 172 developers and more than 1,000 commits! Visit the release notes [1] to read about the new features, or

Re: SparkSQL UDTs with Ordering

2015-03-24 Thread Patrick Woody
have to be on the internal form, not the user visible form. On Tue, Mar 24, 2015 at 12:25 PM, Patrick Woody patrick.woo...@gmail.com wrote: Hey all, Currently looking into UDTs and I was wondering if it is reasonable to add the ability to define an Ordering (or if this is possible, then how

Re: 1.3 Hadoop File System problem

2015-03-24 Thread Patrick Wendell
Hey Jim, Thanks for reporting this. Can you give a small end-to-end code example that reproduces it? If so, we can definitely fix it. - Patrick On Tue, Mar 24, 2015 at 4:55 PM, Jim Carroll jimfcarr...@gmail.com wrote: I have code that works under 1.2.1 but when I upgraded to 1.3.0 it fails

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-27 Thread Patrick Wendell
I think we need to just update the docs, it is a bit unclear right now. At the time, we made it worded fairly sternly because we really wanted people to use --jars when we deprecated SPARK_CLASSPATH. But there are other types of deployments where there is a legitimate need to augment the classpath

Re: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Patrick Wendell
not starting correctly. - Patrick On Mon, Feb 23, 2015 at 1:13 AM, Oleg Shirokikh o...@solver.com wrote: Patrick, I haven't changed the configs much. I just executed ec2-script to create 1 master, 2 slaves cluster. Then I try to submit the jobs from remote machine leaving all defaults configured

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-26 Thread Patrick Varilly
into, but past 255, you run into underlying limitations of the JVM https://issues.scala-lang.org/browse/SI-7324). Best, Patrick On Thu, Feb 26, 2015 at 11:58 AM, anamika gupta anamika.guo...@gmail.com wrote: Hi Patrick Thanks a ton for your in-depth answer. The compilation error is now

Re: Add PredictionIO to Powered by Spark

2015-02-24 Thread Patrick Wendell
Added - thanks! I trimmed it down a bit to fit our normal description length. On Mon, Jan 5, 2015 at 8:24 AM, Thomas Stone tho...@prediction.io wrote: Please can we add PredictionIO to https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark PredictionIO http://prediction.io/

Re: Can you add Big Industries to the Powered by Spark page?

2015-02-24 Thread Patrick Wendell
I've added it, thanks! On Fri, Feb 20, 2015 at 12:22 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, Could you please add Big Industries to the Powered by Spark page at https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark ? Company Name: Big Industries URL:

Re: RDD resiliency -- does it keep state?

2015-03-27 Thread Patrick Wendell
reporting the result back to the driver. This means you need to make sure the side-effects are idempotent, or use some transactional locking. Spark's own output operations, such as saving to Hadoop, use such mechanisms. For instance, in the case of Hadoop it uses the OutputCommitter classes. - Patrick

Re: Spark 1.3 Source - Github and source tar does not seem to match

2015-03-27 Thread Patrick Wendell
The source code should match the Spark commit 4aaf48d46d13129f0f9bdafd771dd80fe568a7dc. Do you see any differences? On Fri, Mar 27, 2015 at 11:28 AM, Manoj Samel manojsamelt...@gmail.com wrote: While looking into a issue, I noticed that the source displayed on Github site does not matches the

Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Patrick Wendell
an iterator. - Patrick On Thu, Mar 26, 2015 at 3:07 PM, Jonathan Coveney jcove...@gmail.com wrote: This is just a deficiency of the api, imo. I agree: mapValues could definitely be a function (K, V)=V1. The option isn't set by the function, it's on the RDD. So you could look at the code and do

SparkSQL UDTs with Ordering

2015-03-24 Thread Patrick Woody
Hey all, Currently looking into UDTs and I was wondering if it is reasonable to add the ability to define an Ordering (or if this is possible, then how)? Currently it will throw an error when non-Native types are used. Thanks! -Pat

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-23 Thread Patrick Wendell
warnings on the executors, not the driver. Correct? - Patrick On Mon, Mar 23, 2015 at 10:21 AM, Martin Goodson mar...@skimlinks.com wrote: Have you tried to repartition() your original data to make more partitions before you aggregate? -- Martin Goodson | VP Data Science (0)20 3397 1240

Re: Configuring amount of disk space available to spark executors in mesos?

2015-04-13 Thread Patrick Wendell
that would make sense. - Patrick On Mon, Apr 13, 2015 at 8:19 AM, Jonathan Coveney jcove...@gmail.com wrote: I'm surprised that I haven't been able to find this via google, but I haven't... What is the setting that requests some amount of disk space for the executors? Maybe I'm misunderstanding

Re: Spark timeout issue

2015-04-26 Thread Patrick Wendell
Hi Deepak - please direct this to the user@ list. This list is for development of Spark itself. On Sun, Apr 26, 2015 at 12:42 PM, Deepak Gopalakrishnan dgk...@gmail.com wrote: Hello All, I'm trying to process a 3.5GB file on standalone mode using spark. I could run my spark job succesfully on

Announcing Spark 1.3.1 and 1.2.2

2015-04-17 Thread Patrick Wendell
: http://spark.apache.org/releases/spark-release-1-3-1.html 1.2.2: http://spark.apache.org/releases/spark-release-1-2-2.html Comprehensive list of fixes: 1.3.1: http://s.apache.org/spark-1.3.1 1.2.2: http://s.apache.org/spark-1.2.2 Thanks to everyone who worked on these releases! - Patrick

Processing Large Images in Spark?

2015-04-06 Thread Patrick Young
-images-td6752.html Further, I'd like to have the imagery in HDFS rather than on the file system to avoid I/O bottlenecks if possible! Thanks for any ideas and advice! -Patrick

Re: Fully in-memory shuffles

2015-06-11 Thread Patrick Wendell
the job will fail if shuffle output exceeds memory. - Patrick On Wed, Jun 10, 2015 at 9:50 PM, Davies Liu dav...@databricks.com wrote: If you have enough memory, you can put the temporary work directory in tempfs (in memory file system). On Wed, Jun 10, 2015 at 8:43 PM, Corey Nolet cjno

Dynamic allocator requests -1 executors

2015-06-12 Thread Patrick Woody
Hey all, I've recently run into an issue where spark dynamicAllocation has asked for -1 executors from YARN. Unfortunately, this raises an exception that kills the executor-allocation thread and the application can't request more resources. Has anyone seen this before? It is spurious and the

Re: Fully in-memory shuffles

2015-06-10 Thread Patrick Wendell
In many cases the shuffle will actually hit the OS buffer cache and not ever touch spinning disk if it is a size that is less than memory on the machine. - Patrick On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet cjno...@gmail.com wrote: So with this... to help my understanding of Spark under

Re: Dynamic allocator requests -1 executors

2015-06-13 Thread Patrick Woody
Hey Sandy, I'll test it out on 1.4. Do you have a bug number or PR that I could reference as well? Thanks! -Pat Sent from my iPhone On Jun 13, 2015, at 11:38 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Patrick, I'm noticing that you're using Spark 1.3.1. We fixed a bug in dynamic

Get Spark version before starting context

2015-07-04 Thread Patrick Woody
Hey all, Is it possible to reliably get the version string of a Spark cluster prior to trying to connect via the SparkContext on the client side? Most of the errors I've seen on mismatched versions have been cryptic, so it would be helpful if I could throw an exception earlier. I know it is

Re: Get Spark version before starting context

2015-07-04 Thread Patrick Woody
To somewhat answer my own question - it looks like an empty request to the rest API will throw an error which returns the version in JSON as well. Still not ideal though. Would there be any objection to adding a simple version endpoint to the API? On Sat, Jul 4, 2015 at 4:00 PM, Patrick Woody

[ANNOUNCE] Announcing Spark 1.4

2015-06-11 Thread Patrick Wendell
Hi All, I'm happy to announce the availability of Spark 1.4.0! Spark 1.4.0 is the fifth release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 210 developers and more than 1,000 commits! A huge thanks go to all of the individuals and organizations

Re: Extremely poor predictive performance with RF in mllib

2015-08-04 Thread Patrick Lam
-mail: user-h...@spark.apache.org -- *-Barak* -- Patrick Lam Institute for Quantitative Social Science, Harvard University http://www.patricklam.org

Re: Twitter4J streaming question

2015-07-23 Thread Patrick McCarthy
How can I tell if it's the sample stream or full stream ? Thanks Sent from my iPhone On Jul 23, 2015, at 4:17 PM, Enno Shioji eshi...@gmail.commailto:eshi...@gmail.com wrote: You are probably listening to the sample stream, and THEN filtering. This means you listen to 1% of the twitter

Re: Twitter4J streaming question

2015-07-23 Thread Patrick McCarthy
, Patrick McCarthy pmccar...@eatonvance.commailto:pmccar...@eatonvance.com wrote: How can I tell if it's the sample stream or full stream ? Thanks Sent from my iPhone On Jul 23, 2015, at 4:17 PM, Enno Shioji eshi...@gmail.commailto:eshi...@gmail.com wrote: You are probably listening to the sample

<    1   2   3   4   >