Re: PySpark API divergence + improving pandas interoperability

2016-03-21 Thread Reynold Xin
Hi Wes, Thanks for the email. It is difficult to generalize without seeing a lot more cases, but the boolean issue is simply a query analysis rule. I can see us having a config option that changes analysis to match more Python/R like, which changes the behavior of implicit type coercion and

[discuss] making SparkEnv private in Spark 2.0

2016-03-20 Thread Reynold Xin
Any objections? Please articulate your use case. SparkEnv is a weird one because it was documented as "private" but not marked as so in class visibility. * NOTE: This is not intended for external use. This is exposed for Shark and may be made private * in a future release. I do see Hive

Re: [discuss] making SparkEnv private in Spark 2.0

2016-03-20 Thread Reynold Xin
On Wed, Mar 16, 2016 at 3:29 PM, Mridul Muralidharan wrote: > b) Shuffle manager (to get shuffle reader) > What's the use case for shuffle manager/reader? This seems like using super internal APIs in applications.

Re: graceful shutdown in external data sources

2016-03-19 Thread Reynold Xin
no longer have any tasks? > It seems to me there is no timeout which is appropriate that is long enough > to ensure that no more tasks will be scheduled on the executor, and short > enough to be appropriate to wait on during an interactive shell shutdown. > > - Dan > > On Wed,

Re: graceful shutdown in external data sources

2016-03-19 Thread Reynold Xin
Maybe just add a watch dog thread and closed the connection upon some timeout? On Wednesday, March 16, 2016, Dan Burkert wrote: > Hi all, > > I'm working on the Spark connector for Apache Kudu, and I've run into an > issue that is a bit beyond my Spark knowledge. The Kudu

Re: pull request template

2016-03-19 Thread Reynold Xin
bit so that it just has >> > instructions prepended with some character, and have those lines >> > removed by the merge_spark_pr.py script? We could then even throw in a >> > link to the wiki as Sean suggested since it won't end up in the final >> > commit m

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Reynold Xin
Thanks for initiating this discussion. I merged the pull request because it was unblocking another major piece of work for Spark 2.0: not requiring assembly jars, which is arguably a lot more important than sources that are less frequently used. I take full responsibility for that. I think it's

Re: df.dtypes -> pyspark.sql.types

2016-03-18 Thread Reynold Xin
We probably should have the alias. Is this still a problem on master branch? On Wed, Mar 16, 2016 at 9:40 AM, Ruslan Dautkhanov wrote: > Running following: > > #fix schema for gaid which should not be Double >> from pyspark.sql.types import * >> customSchema = StructType()

Re: Accessing SparkConf in metrics sink

2016-03-16 Thread Reynold Xin
SparkConf is not a singleton. However, SparkContext in almost all cases are. So you can use SparkContext.getOrCreate().getConf On Wed, Mar 16, 2016 at 12:38 AM, Pete Robbins wrote: > I'm writing a metrics sink and reporter to push metrics to Elasticsearch. > An example

Re: spark 2.0 logging binary incompatibility

2016-03-15 Thread Reynold Xin
Yea we are going to tighten a lot of class' visibility. A lot of APIs were made experimental, developer, or public for no good reason in the past. Many of them (not Logging in this case) are tied to the internal implementation of Spark at a specific time, and no longer make sense given the

Re: Various forks

2016-03-15 Thread Reynold Xin
+Xiangrui On Tue, Mar 15, 2016 at 10:24 AM, Sean Owen wrote: > Picking up this old thread, since we have the same problem updating to > Scala 2.11.8 > > https://github.com/apache/spark/pull/11681#issuecomment-196932777 > > We can see the org.spark-project packages here: > >

Re: dataframe.groupby.agg vs sql("select from groupby)")

2016-03-10 Thread Reynold Xin
They should be identical. Can you paste the detailed explain output. On Thursday, March 10, 2016, FangFang Chen wrote: > hi, > Based on my testing, the memory cost is very different for > 1. sql("select * from ...").groupby.agg > 2. sql("select ... From ... Groupby

Re: Inconsistent file extensions and omitting file extensions written by CSV, TEXT and JSON data sources.

2016-03-08 Thread Reynold Xin
Isn't this just specified by the user? On Tue, Mar 8, 2016 at 9:49 PM, Hyukjin Kwon wrote: > Hi all, > > Currently, the output from CSV, TEXT and JSON data sources does not have > file extensions such as .csv, .txt and .json (except for compression > extensions such as

Re: Spark Scheduler creating Straggler Node

2016-03-08 Thread Reynold Xin
You just want to be able to replicate hot cached blocks right? On Tuesday, March 8, 2016, Prabhu Joseph wrote: > Hi All, > > When a Spark Job is running, and one of the Spark Executor on Node A > has some partitions cached. Later for some other stage, Scheduler

Re: BUILD FAILURE due to...Unable to find configuration file at location dev/scalastyle-config.xml

2016-03-07 Thread Reynold Xin
+Sean, who was playing with this. On Mon, Mar 7, 2016 at 11:38 PM, Jacek Laskowski wrote: > Hi, > > Got the BUILD FAILURE. Anyone looking into it? > > ➜ spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6 > -Dhadoop.version=2.7.2 -Phive -Phive-thriftserver -DskipTests

Re: More Robust DataSource Parameters

2016-03-07 Thread Reynold Xin
ouild be greatly > appreciated). > > With the above answer to #1 and contingent on finding a solution to the > API stability part of it, would you be supportive of a change to do this? > If so, I'll submit a JIRA first and solicit/brainstorm some ideas on how to > do #2

Re: Dynamic allocation availability on standalone mode. Misleading doc.

2016-03-07 Thread Reynold Xin
The doc fix was merged in 1.6.1, so it will get updated automatically once we push the 1.6.1 docs. On Mon, Mar 7, 2016 at 5:40 PM, Saisai Shao wrote: > Yes, we need to fix the document. > > On Tue, Mar 8, 2016 at 9:07 AM, Mark Hamstra > wrote:

Re: Typo in community databricks cloud docs

2016-03-07 Thread Reynold Xin
Thanks - I've fixed it and it will go out next time we update. For future reference, you can email directly supp...@databricks.com for this. Again - thanks for reporting this. On Sat, Mar 5, 2016 at 4:23 PM, Eugene Morozov wrote: > Hi, I'm not sure where to put

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-07 Thread Reynold Xin
+1 (binding) On Sun, Mar 6, 2016 at 12:08 PM, Egor Pahomov wrote: > +1 > > Spark ODBC server is fine, SQL is fine. > > 2016-03-03 12:09 GMT-08:00 Yin Yang : > >> Skipping docker tests, the rest are green: >> >> [INFO] Spark Project External Kafka

Re: getting a list of executors for use in getPreferredLocations

2016-03-03 Thread Reynold Xin
What do you mean by consistent? Throughout the life cycle of an app, the executors can come and go and as a result really has no consistency. Do you just need it for a specific job? On Thu, Mar 3, 2016 at 3:08 PM, Cody Koeninger wrote: > I need getPreferredLocations to

Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-02 Thread Reynold Xin
SQL is very common and even some business analysts learn them. Scala and Python are great, but the easiest language to use is often the languages a user already knows. And for a lot of users, that is SQL. On Wednesday, March 2, 2016, Jerry Lam wrote: > Hi guys, > > FYI...

Re: [Proposal] Enabling time series analysis on spark metrics

2016-03-01 Thread Reynold Xin
Is the suggestion just to use a different config (and maybe fallback to appid) in order to publish metrics? Seems reasonable. On Tue, Mar 1, 2016 at 8:17 AM, Karan Kumar wrote: > +dev mailing list > > Time series analysis on metrics becomes quite useful when running

Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-01 Thread Reynold Xin
There are definitely pros and cons for Scala vs SQL-style CEP. Scala might be more powerful, but the target audience is very different. How much usage is there for a CEP style SQL syntax in practice? I've never seen it coming up so far. On Tue, Mar 1, 2016 at 9:35 AM, Alex Kozlov

Re: Is spark.driver.maxResultSize used correctly ?

2016-03-01 Thread Reynold Xin
data skew might be possible, but not the common case. I think we should > design for the common case, for the skew case, we may can set some > parameter of fraction to allow user to tune it. > > On Sat, Feb 27, 2016 at 4:51 PM, Reynold Xin <r...@databricks.com > <javascript:_e(%7B%

Re: Spark performance comparison for research

2016-02-29 Thread Reynold Xin
That seems reasonable, but it seems pretty unfair to the HPC setup in which the master is reading all the data. Basically you can make HPC perform infinitely worse by just adding more modes to Spark. On Monday, February 29, 2016, yasincelik wrote: > Hello, > > I am

Re: Is spark.driver.maxResultSize used correctly ?

2016-02-27 Thread Reynold Xin
But sometimes you might have skew and almost all the result data are in one or a few tasks though. On Friday, February 26, 2016, Jeff Zhang wrote: > > My job get this exception very easily even when I set large value of > spark.driver.maxResultSize. After checking the spark

Re: some joins stopped working with spark 2.0.0 SNAPSHOT

2016-02-27 Thread Reynold Xin
Can you file a JIRA ticket? On Friday, February 26, 2016, Koert Kuipers wrote: > dataframe df1: > schema: > StructType(StructField(x,IntegerType,true)) > explain: > == Physical Plan == > MapPartitions , obj#135: object, [if (input[0, > object].isNullAt) null else input[0,

Re: More Robust DataSource Parameters

2016-02-26 Thread Reynold Xin
Thanks for the email. This sounds great in theory, but might run into two major problems: 1. Need to support 4+ programming languages (SQL, Python, Java, Scala) 2. API stability (both backward and forward) On Fri, Feb 26, 2016 at 8:44 AM, Hamel Kothari wrote: > Hi

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-26 Thread Reynold Xin
pache.spark.sql.Dataset[Int] = [value: int] > > > > scala> ds.schema.json > > res17: String = > {"type":"struct","fields":[{"name":"value","type":"integer","nullable":false,"metadata":{}}]} &g

External dependencies in public APIs (was previously: Upgrading to Kafka 0.9.x)

2016-02-26 Thread Reynold Xin
Dropping Kafka list since this is about a slightly different topic. Every time we expose the API of a 3rd party application as a public Spark API has caused some problems down the road. This goes from Hadoop, Tachyon, Kafka, to Guava. Most of these are used for input/output. The good thing is

Re: how about a custom coalesce() policy?

2016-02-26 Thread Reynold Xin
Using the right email for Nezih On Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin <r...@databricks.com> wrote: > I think this can be useful. > > The only thing is that we are slowly migrating to the Dataset/DataFrame > API, and leave RDD mostly as is as a lower level API. Maybe w

Re: how about a custom coalesce() policy?

2016-02-26 Thread Reynold Xin
I think this can be useful. The only thing is that we are slowly migrating to the Dataset/DataFrame API, and leave RDD mostly as is as a lower level API. Maybe we should do both? In either case it would be great to discuss the API on a pull request. Cheers. On Wed, Feb 24, 2016 at 2:08 PM, Nezih

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Reynold Xin
ight subtle difference between > DataFrame and Dataset[Row]? For example, > > Dataset[T] joinWith Dataset[U] produces Dataset[(T, U)] > > So, > > Dataset[Row] joinWith Dataset[Row] produces Dataset[(Row, Row)] > > > > While > > DataFrame join DataFrame is still D

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Reynold Xin
mean that the concept of DataFrame ceases to exist from a > java perspective, and they will have to refer to Dataset? > > On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin <r...@databricks.com> wrote: > >> When we first introduced Dataset in 1.6 as an experimental API, we wanted >>

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Reynold Xin
politically) to have a JavaDataFrame, as a way to isolate the 1000+ extra > lines to a Java compatibility layer/class? > > > ---------- > *From:* Reynold Xin <r...@databricks.com> > *To:* "dev@spark.apache.org" <dev@spark.apache.org> >

[discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Reynold Xin
When we first introduced Dataset in 1.6 as an experimental API, we wanted to merge Dataset/DataFrame but couldn't because we didn't want to break the pre-existing DataFrame API (e.g. map function should return Dataset, rather than RDD). In Spark 2.0, one of the main API changes is to merge

Spark Summit (San Francisco, June 6-8) call for presentation due in less than week

2016-02-24 Thread Reynold Xin
Just want to send a reminder in case people don't know about it. If you are working on (or with, using) Spark, consider submitting your work to Spark Summit, coming up in June in San Francisco. https://spark-summit.org/2016/call-for-presentations/ Cheers.

Re: spark core api vs. google cloud dataflow

2016-02-23 Thread Reynold Xin
That's the just transform function in DataFrame /** * Concise syntax for chaining custom transformations. * {{{ * def featurize(ds: DataFrame) = ... * * df * .transform(featurize) * .transform(...) * }}} * @since 1.6.0 */ def transform[U](t: DataFrame

Re: Spark 1.6.1

2016-02-22 Thread Reynold Xin
; On Tue, Feb 23, 2016 at 9:34 AM, Reynold Xin <r...@databricks.com> wrote: > >> We usually publish to a staging maven repo hosted by the ASF (not maven >> central). >> >> >> >> On Mon, Feb 22, 2016 at 11:32 PM, Romi Kuntsman <r...@totango.com>

Re: Spark 1.6.1

2016-02-22 Thread Reynold Xin
We usually publish to a staging maven repo hosted by the ASF (not maven central). On Mon, Feb 22, 2016 at 11:32 PM, Romi Kuntsman wrote: > Is it possible to make RC versions available via Maven? (many projects do > that) > That will make integration much easier, so many more

Re: DataFrame API and Ordering

2016-02-21 Thread Reynold Xin
ets Guide already has a > section about NaN semantics. This could be a good place to add at least > some basic description. > > For the rest InterpretedOrdering could be a good choice. > > On 02/19/2016 12:35 AM, Reynold Xin wrote: > > You are correct and we should document that

Re: Using Encoding to reduce GraphX's static graph memory consumption

2016-02-21 Thread Reynold Xin
+ Joey We think this is worth doing. Are you interested in submitting a pull request? On Sat, Feb 20, 2016 at 8:05 PM ahaider3 wrote: > Hi, > I have been looking through the GraphX source code, dissecting the reason > for its high memory consumption compared to the

Re: pull request template

2016-02-19 Thread Reynold Xin
there the spec for the PR title. I > always > > get wrong the order between Jira and component. > > > > Moreover, CONTRIBUTING.md is also lacking them. Any reason not to add it > > there? I can open PRs for both, but maybe you want to keep that info on > the > > wiki ins

Re: Ability to auto-detect input data for datasources (by file extension).

2016-02-18 Thread Reynold Xin
Thanks for the email. Don't make it that complicated. We just want to simplify the common cases (e.g. csv/parquet), and don't need this to work for everything out there. On Thu, Feb 18, 2016 at 9:25 PM, Hyukjin Kwon wrote: > Hi all, > > I am planning to submit a PR for >

Re: DataFrame API and Ordering

2016-02-18 Thread Reynold Xin
You are correct and we should document that. Any suggestions on where we should document this? In DoubleType and FloatType? On Tuesday, February 16, 2016, Maciej Szymkiewicz wrote: > I am not sure if I've missed something obvious but as far as I can tell > DataFrame API

Re: Kafka connector mention in Matei's keynote

2016-02-18 Thread Reynold Xin
I think Matei was referring to the Kafka direct streaming source added in 2015. On Thu, Feb 18, 2016 at 11:59 AM, Cody Koeninger wrote: > I saw this slide: >

pull request template

2016-02-17 Thread Reynold Xin
Github introduced a new feature today that allows projects to define templates for pull requests. I pushed a very simple template to the repository: https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE Over time I think we can see how this works and perhaps add a small

Re: Dataset in spark 2.0.0-SNAPSHOT missing columns

2016-02-15 Thread Reynold Xin
Looks like a bug. I'm also not sure whether we support Option yet. (If not, we should definitely support that in 2.0.) Can you file a JIRA ticket? On Mon, Feb 15, 2016 at 7:12 AM, Koert Kuipers wrote: > i noticed some things stopped working on datasets in spark

Spark Summit San Francisco 2016 call for presentations (CFP)

2016-02-11 Thread Reynold Xin
FYI, Call for presentations is now open for Spark Summit. The event will take place on June 6-8 in San Francisco. Submissions are welcome across a variety of Spark-related topics, including applications, development, data science, business value, spark ecosystem and research. Please submit by

Re: map-side-combine in Spark SQL

2016-02-10 Thread Reynold Xin
I'm not 100% sure I understand your question, but yes, Spark (both the RDD API and SQL/DataFrame) does partial aggregation. On Tue, Feb 9, 2016 at 8:37 PM, Rishitesh Mishra wrote: > Can anybody confirm, whether ANY operator in Spark SQL uses > map-side-combine ? If

Re: Scala API: simplifying common patterns

2016-02-08 Thread Reynold Xin
Can you create a pull request? It is difficult to know what's going on. On Mon, Feb 8, 2016 at 4:51 PM, sim wrote: > 24 test failures for sql/test: > https://gist.github.com/ssimeonov/89862967f87c5c497322 > > > > -- > View this message in context: >

Re: Scala API: simplifying common patterns

2016-02-07 Thread Reynold Xin
Both of these make sense to add. Can you submit a pull request? On Sun, Feb 7, 2016 at 3:29 PM, sim wrote: > The more Spark code I write, the more I hit the same use cases where the > Scala APIs feel a bit awkward. I'd love to understand if there are > historical reasons for

Re: Scala API: simplifying common patterns

2016-02-07 Thread Reynold Xin
Not 100% sure what's going on, but you can try wiping your local ivy2 and maven cache. On Mon, Feb 8, 2016 at 12:05 PM, sim wrote: > Reynold, I just forked + built master and I'm getting lots of binary > compatibility errors when running the tests. > >

Re: Scala API: simplifying common patterns

2016-02-07 Thread Reynold Xin
Yea I'm not sure what's going on either. You can just run the unit tests through "build/sbt sql/test" without running mima. On Mon, Feb 8, 2016 at 3:47 PM, sim wrote: > Same result with both caches cleared. > > > > -- > View this message in context: >

Re: Preserving partitioning with dataframe select

2016-02-07 Thread Reynold Xin
Matt, Thanks for the email. Are you just asking whether it should work, or reporting they don't work? Internally, the way we track physical data distribution should make the scenarios described work. If it doesn't, we should make them work. On Sat, Feb 6, 2016 at 6:49 AM, Matt Cheah

Re: Interested in Contributing to Spark as GSoC 2016

2016-02-04 Thread Reynold Xin
I will email you offline. On Thursday, February 4, 2016, Tao Lin wrote: > Hi All, > I am Tao Lin, a senior Computer Science student highly interested in Data > Science (Distributed Computing, Machine Learning, Visualization, etc.). I'd > like to join Google Summer of Code

Re: Scala 2.11 default build

2016-02-01 Thread Reynold Xin
QA%20Compile/job/SPARK-master-COMPILE-MAVEN-SCALA-2.10/ > > > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/SPARK-master-COMPILE-sbt-SCALA-2.10/ > > > > FYI > > > > On Mon, Feb 1, 2016 at 4:22 AM, Steve Loughran <ste...@hortonworks.com>

Scala 2.11 default build

2016-01-30 Thread Reynold Xin
FYI - I just merged Josh's pull request to switch to Scala 2.11 as the default build. https://github.com/apache/spark/pull/10608

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Reynold Xin
It is not necessary if you are using bucketing available in Spark 2.0. For partitioning, it is still necessary because we do not assume each partition is small, and as a result there is no guarantee all the records for a partition end up in a single Spark task partition. On Thu, Jan 21, 2016 at

Re: Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Reynold Xin
, > StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None > > > > On Fri, Jan 22, 2016 at 12:13 PM, Reynold Xin <r...@databricks.com > <javascript:_e(%7B%7D,'cvml','r...@databricks.com');>> wrote: > >> It is not necessary if you are using bu

Re: [1.6] Coalesce/binary operator on casted named column

2016-01-17 Thread Reynold Xin
To close the loop: JIRA filed: https://issues.apache.org/jira/browse/SPARK-12841 Patch created (to be merged): https://github.com/apache/spark/pull/10781 On Fri, Jan 15, 2016 at 8:54 AM, Robert Kruszewski wrote: > Hi Spark devs, > > I have been debugging failing unit

Re: Are we running SparkR tests in Jenkins?

2016-01-15 Thread Reynold Xin
+Shivaram Ah damn - we should fix it. This was broken by https://github.com/apache/spark/pull/10658 - which removed a functionality that has been deprecated since Spark 1.0. On Fri, Jan 15, 2016 at 3:19 PM, Herman van Hövell tot Westerflier < hvanhov...@questtec.nl> wrote: > Hi all, > > I

Re: [discuss] dropping Hadoop 2.2 and 2.3 support in Spark 2.0?

2016-01-14 Thread Reynold Xin
Thanks for chiming in. Note that an organization's agility in Spark upgrades can be very different from Hadoop upgrades. For many orgs, Hadoop is responsible for cluster resource scheduling (YARN) and data storage (HDFS). These two are notorious difficult to upgrade. It is all or nothing for a

[discuss] dropping Hadoop 2.2 and 2.3 support in Spark 2.0?

2016-01-13 Thread Reynold Xin
We've dropped Hadoop 1.x support in Spark 2.0. There is also a proposal to drop Hadoop 2.2 and 2.3, i.e. the minimal Hadoop version we support would be Hadoop 2.4. The main advantage is then we'd be able to focus our Jenkins resources (and the associated maintenance of Jenkins) to create builds

Re: Tungsten in a mixed endian environment

2016-01-12 Thread Reynold Xin
How big of a deal this use case is in a heterogeneous endianness environment? If we do want to fix it, we should do it when right before Spark shuffles data to minimize performance penalty, i.e. turn big-endian encoded data into little-indian encoded data before it goes on the wire. This is a

Re: Dependency on TestingUtils in a Spark package

2016-01-12 Thread Reynold Xin
If you need it, just copy it over to your own package. That's probably the safest option. On Tue, Jan 12, 2016 at 12:50 PM, Ted Yu wrote: > There is no annotation in TestingUtils class indicating whether it is > suitable for consumption by external projects. > > You

Re: Automated close of PR's ?

2015-12-31 Thread Reynold Xin
l.com> > wrote: > >> I am not sure of others, but I had a PR close from under me where > >> ongoing discussion was as late as 2 weeks back. > >> Given this, I assumed it was automated close and not manual ! > >> > >> When the change was opened is not a go

Re: IndentationCheck of checkstyle

2015-12-29 Thread Reynold Xin
OK to close the loop - this thread has nothing to do with Spark? On Tue, Dec 29, 2015 at 9:55 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Oops, wrong list :-) > > On Dec 29, 2015, at 9:48 PM, Reynold Xin <r...@databricks.com> wrote: > > +Herman > > Is this coming

Re: Akka with Spark

2015-12-26 Thread Reynold Xin
We are just removing Spark's dependency on Akka. It has nothing to do with whether user applications can use Akka or not. As a matter of fact, by removing the Akka dependency from Spark, it becomes easier for user applications to use Akka, because there is no more dependency conflict. For more

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-22 Thread Reynold Xin
+1 On Tue, Dec 22, 2015 at 12:29 PM, Michael Armbrust wrote: > I'll kick the voting off with a +1. > > On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust > wrote: > >> Please vote on releasing the following candidate as Apache Spark version >>

Re: A proposal for Spark 2.0

2015-12-22 Thread Reynold Xin
on the wiki for reference. > > Tom > > > On Tuesday, December 22, 2015 12:12 AM, Reynold Xin <r...@databricks.com> > wrote: > > > FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. > > On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <r...@data

Re: Tungsten gives unexpected results when selecting null elements in array

2015-12-21 Thread Reynold Xin
Thanks for the email. Do you mind creating a JIRA ticket and reply with a link to the ticket? On Mon, Dec 21, 2015 at 1:12 PM, PierreB < pierre.borckm...@realimpactanalytics.com> wrote: > I believe the problem is that the generated code does not check if the > selected item in the array is null.

Re: A proposal for Spark 2.0

2015-12-21 Thread Reynold Xin
FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <r...@databricks.com> wrote: > I’m starting a new thread since the other one got intermixed with feature > requests. Please refrain from making feature request in

Re: A proposal for Spark 2.0

2015-12-21 Thread Reynold Xin
5:59,"Allen Zhang" <allenzhang...@126.com> 写道: > > Hi Reynold, > > Any new API support for GPU computing in our 2.0 new version ? > > -Allen > > > > > 在 2015-12-22 14:12:50,"Reynold Xin" <r...@databricks.com> 写道: > > FYI I upda

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-14 Thread Reynold Xin
+1 Tested some dataframe operations on my Mac. On Saturday, December 12, 2015, Michael Armbrust wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.0! > > The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes > if a

Re: JIRA: Wrong dates from imported JIRAs

2015-12-11 Thread Reynold Xin
Thanks for looking at this. Is it worth fixing? Is there a risk (although small) that the re-import would break other things? Most of those are done and I don't know how often people search JIRAs by date across projects. On Fri, Dec 11, 2015 at 3:40 PM, Lars Francke

Re: coalesce at DataFrame missing argument for shuffle.

2015-12-11 Thread Reynold Xin
I am not sure if we need it. The RDD API has way too many methods and parameters. As you said, it is simply "repartition". On Fri, Dec 11, 2015 at 2:56 PM, Hyukjin Kwon wrote: > Hi all, > > I accidentally met coalesce() function and found this taking arguments > different

Re: Does RDD[Type1, Iterable[Type2]] split into multiple partitions?

2015-12-10 Thread Reynold Xin
No, since the signature itself limits it. On Thu, Dec 10, 2015 at 9:19 PM, JaeSung Jun wrote: > Hi, > > I'm currently working on Iterable type of RDD, which is like : > > val keyValueIterableRDD[CaseClass1, Iterable[CaseClass2]] = buildRDD(...) > > If there is only one

Re: Failed to generate predicate Error when using dropna

2015-12-08 Thread Reynold Xin
Can you create a JIRA ticket for this? Thanks. On Tue, Dec 8, 2015 at 5:25 PM, Chang Ya-Hsuan wrote: > spark version: spark-1.5.2-bin-hadoop2.6 > python version: 2.7.9 > os: ubuntu 14.04 > > code to reproduce error > > # write.py > > import pyspark > sc =

Re: Returning numpy types from udfs

2015-12-05 Thread Reynold Xin
Not aware of any jira ticket, but it does sound like a great idea. On Sat, Dec 5, 2015 at 11:03 PM, Justin Uang wrote: > Hi, > > I have fallen into the trap of returning numpy types from udfs, such as > np.float64 and np.int. It's hard to find the issue because they

Re: IntelliJ license for committers?

2015-12-02 Thread Reynold Xin
For IntelliJ I think the free version is sufficient for Spark development. On Thursday, December 3, 2015, Sean Owen wrote: > Yeah I can see the PMC list as it happens; technically there are > committers that aren't PMC / ASF members though, yeah. Josh did update > the list

Re: Subtract implementation using broadcast

2015-11-27 Thread Reynold Xin
We need to first implement subtract and intersect in Spark SQL natively first (i.e. add physical operator for them rather than using RDD.subtract/intersect). Then it should be pretty easy to do that, given it is just about injecting the right exchange operators. > On Nov 27, 2015, at 11:19

Re: A proposal for Spark 2.0

2015-11-26 Thread Reynold Xin
patibility in the move to 2.0 makes it much more >>> difficult for them to make this transition. >>> >>> Using the same set of APIs also means that it will be easier to backport >>> critical fixes to the 1.x line. >>> >>> It's not clear to me that avoiding

Re: A proposal for Spark 2.0

2015-11-25 Thread Reynold Xin
y absorb all the other ways >> that Spark breaks compatibility in the move to 2.0 makes it much more >> difficult for them to make this transition. >> >> Using the same set of APIs also means that it will be easier to backport >> critical fixes to the 1.x line. &

Re: A proposal for Spark 2.0

2015-11-23 Thread Reynold Xin
I came well >>>>> before >>>>> DataFrames and DataSets, so programming guides, introductory how-to >>>>> articles and the like have, to this point, also tended to emphasize RDDs >>>>> -- >>>>> or at least to deal with them early. Wh

Re: why does shuffle in spark write shuffle data to disk by default?

2015-11-23 Thread Reynold Xin
I think for most jobs the bottleneck isn't in writing shuffle data to disk, since shuffle data needs to be "shuffled" and sent across the network. You can always use a ramdisk yourself. Requiring ramdisk by default would significantly complicate configuration and platform portability. On Mon,

Re: Datasets on experimental dataframes?

2015-11-23 Thread Reynold Xin
The experimental tag is intended for user facing APIs. It has nothing to do with internal dependencies. On Monday, November 23, 2015, Jakob Odersky wrote: > Hi, > > datasets are being built upon the experimental DataFrame API, does this > mean DataFrames won't be

Re: Using spark MLlib without installing Spark

2015-11-21 Thread Reynold Xin
You can use MLlib and Spark directly without "installing anything". Just run Spark in local mode. On Sat, Nov 21, 2015 at 4:05 PM, Rad Gruchalski wrote: > Bowen, > > What Andy is doing in the notebook is a slightly different thing. He’s > using sbt to bring all spark jars

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-20 Thread Reynold Xin
OK I'm not exactly asking for a vote here :) I don't think we should look at it from only maintenance point of view -- because in that case the answer is clearly supporting as few versions as possible (or just rm -rf spark source code and call it a day). It is a tradeoff between the number of

Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-19 Thread Reynold Xin
I proposed dropping support for Hadoop 1.x in the Spark 2.0 email, and I think everybody is for that. https://issues.apache.org/jira/browse/SPARK-11807 Sean suggested also dropping support for Hadoop 2.2, 2.3, and 2.4. That is to say, keep only Hadoop 2.6 and greater. What are the community's

Re: orc read issue n spark

2015-11-18 Thread Reynold Xin
What do you mean by starts delay scheduling? Are you saying it is no longer doing local reads? If that's the case you can increase the spark.locality.read timeout. On Wednesday, November 18, 2015, Renu Yadav wrote: > Hi , > I am using spark 1.4.1 and saving orc file using >

Re: How to Add builtin geometry type to SparkSQL?

2015-11-18 Thread Reynold Xin
Have you looked into https://github.com/harsha2010/magellan ? On Wednesday, November 18, 2015, ddcd wrote: > Hi all, > > I'm considering adding geometry type to SparkSQL. > > I know that there is a project named sparkGIS > which is an

Re: Are map tasks spilling data to disk?

2015-11-15 Thread Reynold Xin
It depends on what the next operator is. If the next operator is just an aggregation, then no, the hash join won't write anything to disk. It will just stream the data through to the next operator. If the next operator is shuffle (exchange), then yes. On Sun, Nov 15, 2015 at 10:52 AM, gsvic

Re: Support for local disk columnar storage for DataFrames

2015-11-15 Thread Reynold Xin
treaming >> apps can take advantage of the compact columnar representation and Tungsten >> optimisations. >> >> I'm not quite sure if something like this can be achieved by other means >> or has been investigated before, hence why I'm looking for feedback here. >&g

Re: Hive on Spark Vs Spark SQL

2015-11-15 Thread Reynold Xin
It's a completely different path. On Sun, Nov 15, 2015 at 10:37 PM, kiran lonikar wrote: > I would like to know if Hive on Spark uses or shares the execution code > with Spark SQL or DataFrames? > > More specifically, does Hive on Spark benefit from the changes made to >

Re: Hive on Spark Vs Spark SQL

2015-11-15 Thread Reynold Xin
No it does not -- although it'd benefit from some of the work to make shuffle more robust. On Sun, Nov 15, 2015 at 10:45 PM, kiran lonikar <loni...@gmail.com> wrote: > So does not benefit from Project Tungsten right? > > > On Mon, Nov 16, 2015 at 12:07 PM, Reynold Xin &l

Re: SparkPullRequestBuilder coverage

2015-11-13 Thread Reynold Xin
It only runs tests that are impacted by the change. E.g. if you only modify SQL, it won't run the core or streaming tests. On Fri, Nov 13, 2015 at 11:17 AM, Ted Yu wrote: > Hi, > I noticed that SparkPullRequestBuilder completes much faster than maven > Jenkins build. > >

Re: Spark 1.4.2 release and votes conversation?

2015-11-13 Thread Reynold Xin
I actually tried to build a binary for 1.4.2 and wanted to start voting, but there was an issue with the release script that failed the jenkins job. Would be great to kick off a 1.4.2 release. On Fri, Nov 13, 2015 at 1:00 PM, Andrew Lee wrote: > Hi All, > > > I'm wondering

Re: SparkPullRequestBuilder coverage

2015-11-13 Thread Reynold Xin
y test(s) be disabled, strengthened and enabled again ? > > Cheers > > On Fri, Nov 13, 2015 at 11:20 AM, Reynold Xin <r...@databricks.com> wrote: > >> It only runs tests that are impacted by the change. E.g. if you only >> modify SQL, it won't run the core or streaming te

Re: Spark 1.4.2 release and votes conversation?

2015-11-13 Thread Reynold Xin
In the interim, you can just build it off branch-1.4 if you want. On Fri, Nov 13, 2015 at 1:30 PM, Reynold Xin <r...@databricks.com> wrote: > I actually tried to build a binary for 1.4.2 and wanted to start voting, > but there was an issue with the release script that failed the

<    3   4   5   6   7   8   9   10   11   12   >