[jira] [Created] (SPARK-9923) ShuffleMapStage.numAvailableOutputs should be an Int instead of Long

2015-08-12 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9923: Summary: ShuffleMapStage.numAvailableOutputs should be an Int instead of Long Key: SPARK-9923 URL: https://issues.apache.org/jira/browse/SPARK-9923 Project: Spark

[jira] [Updated] (SPARK-9923) ShuffleMapStage.numAvailableOutputs should be an Int instead of Long

2015-08-12 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-9923: - Labels: Starter (was: ) > ShuffleMapStage.numAvailableOutputs should be an Int instead of L

[jira] [Updated] (SPARK-9850) Adaptive execution in Spark

2015-08-11 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-9850: - Issue Type: Epic (was: New Feature) > Adaptive execution in Sp

[jira] [Updated] (SPARK-9850) Adaptive execution in Spark

2015-08-11 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-9850: - Assignee: Yin Huai > Adaptive execution in Sp

[jira] [Created] (SPARK-9853) Optimize shuffle fetch of contiguous partition IDs

2015-08-11 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9853: Summary: Optimize shuffle fetch of contiguous partition IDs Key: SPARK-9853 URL: https://issues.apache.org/jira/browse/SPARK-9853 Project: Spark Issue Type

[jira] [Assigned] (SPARK-9852) Let HashShuffleFetcher fetch multiple map output partitions

2015-08-11 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-9852: Assignee: Matei Zaharia > Let HashShuffleFetcher fetch multiple map output partiti

[jira] [Created] (SPARK-9851) Add support for submitting map stages individually in DAGScheduler

2015-08-11 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9851: Summary: Add support for submitting map stages individually in DAGScheduler Key: SPARK-9851 URL: https://issues.apache.org/jira/browse/SPARK-9851 Project: Spark

[jira] [Assigned] (SPARK-9851) Add support for submitting map stages individually in DAGScheduler

2015-08-11 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-9851: Assignee: Matei Zaharia > Add support for submitting map stages individually

[jira] [Created] (SPARK-9852) Let HashShuffleFetcher fetch multiple map output partitions

2015-08-11 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9852: Summary: Let HashShuffleFetcher fetch multiple map output partitions Key: SPARK-9852 URL: https://issues.apache.org/jira/browse/SPARK-9852 Project: Spark

[jira] [Updated] (SPARK-9850) Adaptive execution in Spark

2015-08-11 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-9850: - Attachment: AdaptiveExecutionInSpark.pdf > Adaptive execution in Sp

[jira] [Created] (SPARK-9850) Adaptive execution in Spark

2015-08-11 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9850: Summary: Adaptive execution in Spark Key: SPARK-9850 URL: https://issues.apache.org/jira/browse/SPARK-9850 Project: Spark Issue Type: New Feature

[jira] [Resolved] (SPARK-9244) Increase some default memory limits

2015-07-22 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-9244. -- Resolution: Fixed Fix Version/s: 1.5.0 > Increase some default memory lim

[jira] [Created] (SPARK-9244) Increase some default memory limits

2015-07-21 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9244: Summary: Increase some default memory limits Key: SPARK-9244 URL: https://issues.apache.org/jira/browse/SPARK-9244 Project: Spark Issue Type: Improvement

Re: Make off-heap store pluggable

2015-07-20 Thread Matei Zaharia
I agree with this -- basically, to build on Reynold's point, you should be able to get almost the same performance by implementing either the Hadoop FileSystem API or the Spark Data Source API over Ignite in the right way. This would let people save data persistently in Ignite in addition to usi

Re: work around Size exceeds Integer.MAX_VALUE

2015-07-09 Thread Matei Zaharia
Thus means that one of your cached RDD partitions is bigger than 2 GB of data. You can fix it by having more partitions. If you read data from a file system like HDFS or S3, set the number of partitions higher in the sc.textFile, hadoopFile, etc methods (it's an optional second parameter to thos

Re: how can I write a language "wrapper"?

2015-06-23 Thread Matei Zaharia
Just FYI, it would be easiest to follow SparkR's example and add the DataFrame API first. Other APIs will be designed to work on DataFrames (most notably machine learning pipelines), and the surface of this API is much smaller than of the RDD API. This API will also give you great performance as

Welcoming some new committers

2015-06-17 Thread Matei Zaharia
Hey all, Over the past 1.5 months we added a number of new committers to the project, and I wanted to welcome them now that all of their respective forms, accounts, etc are in. Join me in welcoming the following new committers: - Davies Liu - DB Tsai - Kousuke Saruta - Sandy Ryza - Yin Huai Lo

Re: Spark or Storm

2015-06-17 Thread Matei Zaharia
derstand that > 1) There is no global ordering; e.g. an output operation for batch consisting > of offset [4,5,6] can be invoked before the operation for offset [1,2,3] > 2) If you wanted to achieve something similar to what TridentState does, > you'll have to do it yourself (for

Re: Spark or Storm

2015-06-17 Thread Matei Zaharia
This documentation is only for writes to an external system, but all the counting you do within your streaming app (e.g. if you use reduceByKeyAndWindow to keep track of a running count) is exactly-once. When you write to a storage system, no matter which streaming framework you use, you'll have

Re: Remove Hadoop 1 support (Hadoop <2.2) for Spark 1.5?

2015-06-12 Thread Matei Zaharia
I don't like the idea of removing Hadoop 1 unless it becomes a significant maintenance burden, which I don't think it is. You'll always be surprised how many people use old software, even though various companies may no longer support them. With Hadoop 2 in particular, I may be misremembering,

[jira] [Created] (SPARK-8111) SparkR shell should display Spark logo and version banner on startup

2015-06-04 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-8111: Summary: SparkR shell should display Spark logo and version banner on startup Key: SPARK-8111 URL: https://issues.apache.org/jira/browse/SPARK-8111 Project: Spark

[jira] [Updated] (SPARK-8110) DAG visualizations sometimes look weird in Python

2015-06-04 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-8110: - Attachment: Screen Shot 2015-06-04 at 1.51.32 PM.png Screen Shot 2015-06-04 at

[jira] [Created] (SPARK-8110) DAG visualizations sometimes look weird in Python

2015-06-04 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-8110: Summary: DAG visualizations sometimes look weird in Python Key: SPARK-8110 URL: https://issues.apache.org/jira/browse/SPARK-8110 Project: Spark Issue Type

Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-04 Thread Matei Zaharia
+1 Tested on Mac OS X > On Jun 4, 2015, at 1:09 PM, Patrick Wendell wrote: > > I will give +1 as well. > > On Wed, Jun 3, 2015 at 11:59 PM, Reynold Xin wrote: >> Let me give you the 1st >> >> +1 >> >> >> >> On Tue, Jun 2, 2015 at 10:47 PM, Patrick Wendell wrote: >>> >>> He all - a tiny

Re: Equivalent to Storm's 'field grouping' in Spark.

2015-06-03 Thread Matei Zaharia
This happens automatically when you use the byKey operations, e.g. reduceByKey, updateStateByKey, etc. Spark Streaming keeps the state for a given set of keys on a specific node and sends new tuples with that key to that. Matei > On Jun 3, 2015, at 6:31 AM, allonsy wrote: > > Hi everybody, >

Re: map - reduce only with disk

2015-06-02 Thread Matei Zaharia
like they do now? > > Thank you! > > 2015-06-02 21:25 GMT+02:00 Matei Zaharia <mailto:matei.zaha...@gmail.com>>: > You shouldn't have to persist the RDD at all, just call flatMap and reduce on > it directly. If you try to persist it, that will try to load the origin

Re: map - reduce only with disk

2015-06-02 Thread Matei Zaharia
quot;spark.executor.memory", "115g") > conf.set("spark.shuffle.file.buffer.kb", "1000") > > my spark-env.sh: > ulimit -n 200000 > SPARK_JAVA_OPTS="-Xss1g -Xmx129g -d64 -XX:-UseGCOverheadLimit > -XX:-UseCompressedOops" > SPARK

Re: map - reduce only with disk

2015-06-01 Thread Matei Zaharia
As long as you don't use cache(), these operations will go from disk to disk, and will only use a fixed amount of memory to build some intermediate results. However, note that because you're using groupByKey, that needs the values for each key to all fit in memory at once. In this case, if you'r

Re: Representing a recursive data type in Spark SQL

2015-05-28 Thread Matei Zaharia
Your best bet might be to use a map in SQL and make the keys be longer paths (e.g. params_param1 and params_param2). I don't think you can have a map in some of them but not in others. Matei > On May 28, 2015, at 3:48 PM, Jeremy Lucas wrote: > > Hey Reynold, > > Thanks for the suggestion. Ma

Re: Spark logo license

2015-05-19 Thread Matei Zaharia
Check out Apache's trademark guidelines here: http://www.apache.org/foundation/marks/ Matei > On May 20, 2015, at 12:02 AM, Justin Pihony wrote: > > What is the license on using the spark logo. Is it free to be used for > displaying commercially? > >

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Matei Zaharia
ome the limit of tasks per job :) > > cheers, > Tom > > On Tue, May 19, 2015 at 10:05 AM, Matei Zaharia <mailto:matei.zaha...@gmail.com>> wrote: > Hey Tom, > > Are you using the fine-grained or coarse-grained scheduler? For the > coarse-grained scheduler, there

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Matei Zaharia
Hey Tom, Are you using the fine-grained or coarse-grained scheduler? For the coarse-grained scheduler, there is a spark.cores.max config setting that will limit the total # of cores it grabs. This was there in earlier versions too. Matei > On May 19, 2015, at 12:39 PM, Thomas Dudziak wrote: >

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Matei Zaharia
(Sorry, for non-English people: that means it's a good thing.) Matei > On May 14, 2015, at 10:53 AM, Matei Zaharia wrote: > > ...This is madness! > >> On May 14, 2015, at 9:31 AM, dmoralesdf wrote: >> >> Hi there, >> >> We have released

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Matei Zaharia
...This is madness! > On May 14, 2015, at 9:31 AM, dmoralesdf wrote: > > Hi there, > > We have released our real-time aggregation engine based on Spark Streaming. > > SPARKTA is fully open source (Apache2) > > > You can checkout the slides showed up at the Strata past week: > > http://www.s

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-12 Thread Matei Zaharia
It could also be that your hash function is expensive. What is the key class you have for the reduceByKey / groupByKey? Matei > On May 12, 2015, at 10:08 AM, Night Wolf wrote: > > I'm seeing a similar thing with a slightly different stack trace. Ideas? > > org.apache.spark.util.collection.App

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-12 Thread Matei Zaharia
It could also be that your hash function is expensive. What is the key class you have for the reduceByKey / groupByKey? Matei > On May 12, 2015, at 10:08 AM, Night Wolf wrote: > > I'm seeing a similar thing with a slightly different stack trace. Ideas? > > org.apache.spark.util.collection.App

[jira] [Resolved] (SPARK-7298) Harmonize style of new UI visualizations

2015-05-08 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-7298. -- Resolution: Fixed Fix Version/s: 1.4.0 > Harmonize style of new UI visualizati

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-07 Thread Matei Zaharia
We should make sure to update our docs to mention s3a as well, since many people won't look at Hadoop's docs for this. Matei > On May 7, 2015, at 12:57 PM, Nicholas Chammas > wrote: > > Ah, thanks for the pointers. > > So as far as Spark is concerned, is this a breaking change? Is it possibl

[jira] [Assigned] (SPARK-7298) Harmonize style of new UI visualizations

2015-05-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-7298: Assignee: Matei Zaharia (was: Patrick Wendell) > Harmonize style of new UI visualizati

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Matei Zaharia
I don't know whether this is common, but we might also allow another separator for JSON objects, such as two blank lines. Matei > On May 4, 2015, at 2:28 PM, Reynold Xin wrote: > > Joe - I think that's a legit and useful thing to do. Do you want to give it > a shot? > > On Mon, May 4, 2015 at

[jira] [Commented] (SPARK-7261) Change default log level to WARN in the REPL

2015-04-29 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520366#comment-14520366 ] Matei Zaharia commented on SPARK-7261: -- IMO we can do this even without SPARK-

Re: Spark on Windows

2015-04-16 Thread Matei Zaharia
You could build Spark with Scala 2.11 on Mac / Linux and transfer it over to Windows. AFAIK it should build on Windows too, the only problem is that Maven might take a long time to download dependencies. What errors are you seeing? Matei > On Apr 16, 2015, at 9:23 AM, Arun Lists wrote: > > We

Re: Dataset announcement

2015-04-15 Thread Matei Zaharia
Very neat, Olivier; thanks for sharing this. Matei > On Apr 15, 2015, at 5:58 PM, Olivier Chapelle wrote: > > Dear Spark users, > > I would like to draw your attention to a dataset that we recently released, > which is as of now the largest machine learning dataset ever released; see > the fol

Re: [VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-08 Thread Matei Zaharia
+1. Tested on Mac OS X and verified that some of the bugs were fixed. Matei > On Apr 8, 2015, at 7:13 AM, Sean Owen wrote: > > Still a +1 from me; same result (except that now of course the > UISeleniumSuite test does not fail) > > On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell wrote: >> Ple

[jira] [Created] (SPARK-6778) SQL contexts in spark-shell and pyspark should both be called sqlContext

2015-04-08 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-6778: Summary: SQL contexts in spark-shell and pyspark should both be called sqlContext Key: SPARK-6778 URL: https://issues.apache.org/jira/browse/SPARK-6778 Project

Re: Contributor CLAs

2015-04-07 Thread Matei Zaharia
You do actually sign a CLA when you become a committer, and in general, we should ask for CLAs from anyone who contributes a large piece of code. This is the individual CLA: https://www.apache.org/licenses/icla.txt. Some people have sent them proactively because their employer asks them too. Ma

[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391456#comment-14391456 ] Matei Zaharia commented on SPARK-6646: -- Not to rain on the parade here, but I w

Re: Experience using binary packages on various Hadoop distros

2015-03-24 Thread Matei Zaharia
Just a note, one challenge with the BYOH version might be that users who download that can't run in local mode without also having Hadoop. But if we describe it correctly then hopefully it's okay. Matei > On Mar 24, 2015, at 3:05 PM, Patrick Wendell wrote: > > Hey All, > > For a while we've

Re: IPyhon notebook command for spark need to be updated?

2015-03-20 Thread Matei Zaharia
Feel free to send a pull request to fix the doc (or say which versions it's needed in). Matei > On Mar 20, 2015, at 6:49 PM, Krishna Sankar wrote: > > Yep the command-option is gone. No big deal, just add the '%pylab inline' > command as part of your notebook. > Cheers > > > On Fri, Mar 20,

Re: Querying JSON in Spark SQL

2015-03-16 Thread Matei Zaharia
The programming guide has a short example: http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets . Note that once you infer a schema for a JSON dataset, you can also use nested path notation (e.

[jira] [Commented] (SPARK-1564) Add JavaScript into Javadoc to turn ::Experimental:: and such into badges

2015-03-12 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359017#comment-14359017 ] Matei Zaharia commented on SPARK-1564: -- This is still a valid issue AFAIK, isn&

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-08 Thread Matei Zaharia
#x27; go > through a distro instead of get bits from Spark? Different > conversation but I think this sort of effect does not end up being a > negative. > > Well anyway, I like the idea of seeing how far Hadoop-provided > releases can help. It might kill several birds with one s

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

2015-03-08 Thread Matei Zaharia
cross building for Hadoop versions, then it is >>> more tenable to cross build for Scala versions without exploding the >>> number of binaries. >>> >>> - Patrick >>> >>> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen wrote: >>>>

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-08 Thread Matei Zaharia
+1 Tested it on Mac OS X. One small issue I noticed is that the Scala 2.11 build is using Hadoop 1 without Hive, which is kind of weird because people will more likely want Hadoop 2 with Hive. So it would be good to publish a build for that configuration instead. We can do it if we do a new RC

Re: Berlin Apache Spark Meetup

2015-02-17 Thread Matei Zaharia
Thanks! I've added you. Matei > On Feb 17, 2015, at 4:06 PM, Ralph Bergmann | the4thFloor.eu > wrote: > > Hi, > > > there is a small Spark Meetup group in Berlin, Germany :-) > http://www.meetup.com/Berlin-Apache-Spark-Meetup/ > > Plaes add this group to the Meetups list at > https://spark.

Re: renaming SchemaRDD -> DataFrame

2015-02-10 Thread Matei Zaharia
>>>>>>> >>>>>>>>>>>> will >>>>>>>>>>>> show DataFrame as the type. >>>>>>>>>>>> >>>>>>>>>>>> Matei >>>>>

Re: Powered by Spark: Concur

2015-02-09 Thread Matei Zaharia
Thanks Denny; added you. Matei > On Feb 9, 2015, at 10:11 PM, Denny Lee wrote: > > Forgot to add Concur to the "Powered by Spark" wiki: > > Concur > https://www.concur.com > Spark SQL, MLLib > Using Spark for travel and expenses analytics and personalization > > Thanks! > Denny

Re: [VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-08 Thread Matei Zaharia
+1 Tested on Mac OS X. Matei > On Feb 2, 2015, at 8:57 PM, Patrick Wendell wrote: > > Please vote on releasing the following candidate as Apache Spark version > 1.2.1! > > The tag to be voted on is v1.2.1-rc3 (commit b6eaf77): > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h

Re: Beginner in Spark

2015-02-06 Thread Matei Zaharia
You don't need HDFS or virtual machines to run Spark. You can just download it, unzip it and run it on your laptop. See http://spark.apache.org/docs/latest/index.html . Matei > On Feb 6, 2015, at 2:58 PM, David Fallside wrote: > > King, consid

[jira] [Commented] (SPARK-5654) Integrate SparkR into Apache Spark

2015-02-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14309782#comment-14309782 ] Matei Zaharia commented on SPARK-5654: -- Yup, there's a tradeoff, but given

[jira] [Resolved] (SPARK-5608) Improve SEO of Spark documentation site to let Google find latest docs

2015-02-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-5608. -- Resolution: Fixed Fix Version/s: 1.3.0 > Improve SEO of Spark documentation site to

[jira] [Created] (SPARK-5608) Improve SEO of Spark documentation site to let Google find latest docs

2015-02-04 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-5608: Summary: Improve SEO of Spark documentation site to let Google find latest docs Key: SPARK-5608 URL: https://issues.apache.org/jira/browse/SPARK-5608 Project: Spark

Welcoming three new committers

2015-02-03 Thread Matei Zaharia
Hi all, The PMC recently voted to add three new committers: Cheng Lian, Joseph Bradley and Sean Owen. All three have been major contributors to Spark in the past year: Cheng on Spark SQL, Joseph on MLlib, and Sean on ML and many pieces throughout Spark Core. Join me in welcoming them as committ

Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-31 Thread Matei Zaharia
This looks like a pretty serious problem, thanks! Glad people are testing on Windows. Matei > On Jan 31, 2015, at 11:57 AM, MartinWeindel wrote: > > FYI: Spark 1.2.1rc2 does not work on Windows! > > On creating a Spark context you get following log output on my Windows > machine: > INFO org.

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Matei Zaharia
I believe this is needed for driver recovery in Spark Streaming. If your Spark driver program crashes, Spark Streaming can recover the application by reading the set of DStreams and output operations from a checkpoint file (see https://spark.apache.org/docs/latest/streaming-programming-guide.htm

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Matei Zaharia
;> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the covers, it >>> would >>>>> be more clear from a user-facing perspective to at least choose a >>> package >>>>> name for it that omits "sql". >

Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Matei Zaharia
While it might be possible to move this concept to Spark Core long-term, supporting structured data efficiently does require quite a bit of the infrastructure in Spark SQL, such as query planning and columnar storage. The intent of Spark SQL though is to be more than a SQL server -- it's meant t

Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Matei Zaharia
(Actually when we designed Spark SQL we thought of giving it another name, like Spark Schema, but we decided to stick with SQL since that was the most obvious use case to many users.) Matei > On Jan 26, 2015, at 5:31 PM, Matei Zaharia wrote: > > While it might be possible to

Re: Spark performance gains for small queries

2015-01-23 Thread Matei Zaharia
It's hard to tell without more details, but the start-up latency in Hive can sometimes be high, especially if you are running Hive on MapReduce. MR just takes 20-30 seconds per job to spin up even if the job is doing nothing. For real use of Spark SQL for short queries by the way, I'd recommend

Re: Semantics of LGTM

2015-01-17 Thread Matei Zaharia
+1 on this. > On Jan 17, 2015, at 6:16 PM, Reza Zadeh wrote: > > LGTM > > On Sat, Jan 17, 2015 at 5:40 PM, Patrick Wendell wrote: > >> Hey All, >> >> Just wanted to ping about a minor issue - but one that ends up having >> consequence given Spark's volume of reviews and commits. As much as >

Re: Spark UI and Spark Version on Google Compute Engine

2015-01-17 Thread Matei Zaharia
Unfortunately we don't have anything to do with Spark on GCE, so I'd suggest asking in the GCE support forum. You could also try to launch a Spark cluster by hand on nodes in there. Sigmoid Analytics published a package for this here: http://spark-packages.org/package/9 Matei > On Jan 17, 2015

Re: spark 1.2 compatibility

2015-01-16 Thread Matei Zaharia
The Apache Spark project should work with it, but I'm not sure you can get support from HDP (if you have that). Matei > On Jan 16, 2015, at 5:36 PM, Judy Nash > wrote: > > Should clarify on this. I personally have used HDP 2.1 + Spark 1.2 and have > not seen a problem. > > However official

Re: Spark's equivalent of ShellBolt

2015-01-14 Thread Matei Zaharia
You can use the pipe() function on RDDs to call external code. It passes data to an external program through stdin / stdout. For Spark Streaming, you would do dstream.transform(rdd => rdd.pipe(...)) to call it on each RDD. Matei > On Jan 14, 2015, at 8:41 PM, Umanga Bista wrote: > > > This i

Re: SciSpark: NASA AIST14 proposal

2015-01-14 Thread Matei Zaharia
Yeah, very cool! You may also want to check out https://issues.apache.org/jira/browse/SPARK-5097 as something to build upon for these operations. Matei > On Jan 14, 2015, at 6:18 PM, Reynold Xin wrote: > > Chris, > > This is really cool. Congratulations and thanks for sharing the news. > >

[jira] [Updated] (SPARK-5088) Use spark-class for running executors directly on mesos

2015-01-13 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-5088: - Target Version/s: 1.3.0 (was: 1.3.0, 1.2.1) > Use spark-class for running executors directly

[jira] [Updated] (SPARK-5088) Use spark-class for running executors directly on mesos

2015-01-13 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-5088: - Fix Version/s: (was: 1.2.1) > Use spark-class for running executors directly on me

Re: Pattern Matching / Equals on Case Classes in Spark Not Working

2015-01-12 Thread Matei Zaharia
Is this in the Spark shell? Case classes don't work correctly in the Spark shell unfortunately (though they do work in the Scala shell) because we change the way lines of code compile to allow shipping functions across the network. The best way to get case classes in there is to compile them int

[jira] [Resolved] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688

2015-01-09 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3619. -- Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Jongyoul Lee (was: Timothy

Fwd: ApacheCon North America 2015 Call For Papers

2015-01-05 Thread Matei Zaharia
FYI, ApacheCon North America call for papers is up. Matei > Begin forwarded message: > > Date: January 5, 2015 at 9:40:41 AM PST > From: Rich Bowen > Reply-To: dev > To: dev > Subject: ApacheCon North America 2015 Call For Papers > > Fellow ASF enthusiasts, > > We now have less than a month

Fwd: ApacheCon North America 2015 Call For Papers

2015-01-05 Thread Matei Zaharia
FYI, ApacheCon North America call for papers is up. Matei > Begin forwarded message: > > Date: January 5, 2015 at 9:40:41 AM PST > From: Rich Bowen > Reply-To: dev > To: dev > Subject: ApacheCon North America 2015 Call For Papers > > Fellow ASF enthusiasts, > > We now have less than a month

Re: JetS3T settings spark

2014-12-30 Thread Matei Zaharia
This file needs to be on your CLASSPATH actually, not just in a directory. The best way to pass it in is probably to package it into your application JAR. You can put it in src/main/resources in a Maven or SBT project, and check that it makes it into the JAR using jar tf yourfile.jar. Matei >

[jira] [Commented] (SPARK-4660) JavaSerializer uses wrong classloader

2014-12-29 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260544#comment-14260544 ] Matei Zaharia commented on SPARK-4660: -- [~pkolaczk] mind sending a pull req

Re: action progress in ipython notebook?

2014-12-29 Thread Matei Zaharia
Hey Eric, sounds like you are running into several issues, but thanks for reporting them. Just to comment on a few of these: > I'm not seeing RDDs or SRDDs cached in the Spark UI. That page remains empty > despite my calling cache(). This is expected until you compute the RDDs the first time a

Re: How to become spark developer in jira?

2014-12-29 Thread Matei Zaharia
Please ask someone else to assign them for now, and just comment on them that you're working on them. Over time if you contribute a bunch we'll add you to that list. The problem is that in the past, people would assign issues to themselves and never actually work on them, making it confusing for

Re: When will spark 1.2 released?

2014-12-18 Thread Matei Zaharia
Yup, as he posted before, "An Apache infrastructure issue prevented me from pushing this last night. The issue was resolved today and I should be able to push the final release artifacts tonight." > On Dec 18, 2014, at 10:14 PM, Andrew Ash wrote: > > Patrick is working on the release as we spe

Re: wordcount job slow while input from NFS mount

2014-12-17 Thread Matei Zaharia
ment and the NFS server is running on the same server that Spark > is running on. So basically I mount the NFS on the same bare metal machine. > > Larry > > On Wed, Dec 17, 2014 at 11:42 AM, Matei Zaharia <mailto:matei.zaha...@gmail.com>> wrote: > The problem is v

Re: wordcount job slow while input from NFS mount

2014-12-17 Thread Matei Zaharia
The problem is very likely NFS, not Spark. What kind of network is it mounted over? You can also test the performance of your NFS by copying a file from it to a local disk or to /dev/null and seeing how many bytes per second it can copy. Matei > On Dec 17, 2014, at 9:38 AM, Larryliu wrote: >

Re: Spark Web Site

2014-12-15 Thread Matei Zaharia
It's just Bootstrap checked into SVN and built using Jekyll. You can check out the raw source files from SVN from https://svn.apache.org/repos/asf/spark. IMO it's fine if you guys use the layout, but just make sure it doesn't look exactly the same because otherwise both sites will look like they

Re: Spark SQL Roadmap?

2014-12-13 Thread Matei Zaharia
Spark SQL is already available, the reason for the "alpha component" label is that we are still tweaking some of the APIs so we have not yet guaranteed API stability for it. However, that is likely to happen soon (possibly 1.3). One of the major things added in Spark 1.2 was an external data sou

[jira] [Commented] (SPARK-3247) Improved support for external data sources

2014-12-11 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243253#comment-14243253 ] Matei Zaharia commented on SPARK-3247: -- For those looking to learn about

Re: what is the best way to implement mini batches?

2014-12-11 Thread Matei Zaharia
You can just do mapPartitions on the whole RDD, and then called sliding() on the iterator in each one to get a sliding window. One problem is that you will not be able to slide "forward" into the next partition at partition boundaries. If this matters to you, you need to do something more compli

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-10 Thread Matei Zaharia
+1 Tested on Mac OS X. Matei > On Dec 10, 2014, at 1:08 PM, Patrick Wendell wrote: > > Please vote on releasing the following candidate as Apache Spark version > 1.2.0! > > The tag to be voted on is v1.2.0-rc2 (commit a428c446e2): > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commi

Re: dockerized spark executor on mesos?

2014-12-03 Thread Matei Zaharia
I'd suggest asking about this on the Mesos list (CCed). As far as I know, there was actually some ongoing work for this. Matei > On Dec 3, 2014, at 9:46 AM, Dick Davies wrote: > > Just wondered if anyone had managed to start spark > jobs on mesos wrapped in a docker container? > > At present

[jira] [Closed] (SPARK-4690) AppendOnlyMap seems not using Quadratic probing as the JavaDoc

2014-12-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia closed SPARK-4690. Resolution: Invalid > AppendOnlyMap seems not using Quadratic probing as the Java

[jira] [Commented] (SPARK-4690) AppendOnlyMap seems not using Quadratic probing as the JavaDoc

2014-12-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1429#comment-1429 ] Matei Zaharia commented on SPARK-4690: -- Yup, that's the definit

Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-01 Thread Matei Zaharia
+0.9 from me. Tested it on Mac and Windows (someone has to do it) and while things work, I noticed a few recent scripts don't have Windows equivalents, namely https://issues.apache.org/jira/browse/SPARK-4683 and https://issues.apache.org/jira/browse/SPARK-4684. The first one at least would be g

[jira] [Updated] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections

2014-12-01 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4685: - Target Version/s: 1.2.1 (was: 1.2.0) > Update JavaDoc settings to include spark.ml and

[jira] [Updated] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections

2014-12-01 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-4685: - Priority: Trivial (was: Major) > Update JavaDoc settings to include spark.ml and all spark.ml

[jira] [Created] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections

2014-12-01 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4685: Summary: Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections Key: SPARK-4685 URL: https://issues.apache.org/jira/browse/SPARK-4685

[jira] [Created] (SPARK-4684) Add a script to run JDBC server on Windows

2014-12-01 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-4684: Summary: Add a script to run JDBC server on Windows Key: SPARK-4684 URL: https://issues.apache.org/jira/browse/SPARK-4684 Project: Spark Issue Type: New

<    1   2   3   4   5   6   7   8   9   10   >