Re: Welcoming two new committers
great job guys! congrats and welcome! On Mon, Feb 8, 2016 at 12:05 PM, Amit Chavan <achav...@gmail.com> wrote: > Welcome. > > On Mon, Feb 8, 2016 at 2:50 PM, Suresh Thalamati < > suresh.thalam...@gmail.com> wrote: > >> Congratulations Herman and Wenchen! >> >> On Mon, Feb 8, 2016 at 10:59 AM, Andrew Or <and...@databricks.com> wrote: >> >>> Welcome! >>> >>> 2016-02-08 10:55 GMT-08:00 Bhupendra Mishra <bhupendra.mis...@gmail.com> >>> : >>> >>>> Congratulations to both. and welcome to group. >>>> >>>> On Mon, Feb 8, 2016 at 10:45 PM, Matei Zaharia <matei.zaha...@gmail.com >>>> > wrote: >>>> >>>>> Hi all, >>>>> >>>>> The PMC has recently added two new Spark committers -- Herman van >>>>> Hovell and Wenchen Fan. Both have been heavily involved in Spark SQL and >>>>> Tungsten, adding new features, optimizations and APIs. Please join me in >>>>> welcoming Herman and Wenchen. >>>>> >>>>> Matei >>>>> - >>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>>> For additional commands, e-mail: dev-h...@spark.apache.org >>>>> >>>>> >>>> >>> >> > -- Ram Sriharsha Architect, Spark and Data Science Hortonworks, 2550 Great America Way, 2nd Floor Santa Clara, CA 95054 Ph: 408-510-8635 email: har...@apache.org [image: https://www.linkedin.com/in/harsha340] <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane> <https://github.com/harsha2010/>
Re: Predicate push-down bug?
Hi Ravi This does look like a bug.. I have created a JIRA to track it here: https://issues.apache.org/jira/browse/SPARK-10623 Ram On Tue, Sep 15, 2015 at 10:47 AM, Ram Sriharsha <sriharsha@gmail.com> wrote: > Hi Ravi > > Can you share more details? What Spark version are you running? > > Ram > > On Tue, Sep 15, 2015 at 10:32 AM, Ravi Ravi <i.am.ravi.r...@gmail.com> > wrote: > >> Turning on predicate pushdown for ORC datasources results in a >> NoSuchElementException: >> >> scala> val df = sqlContext.sql("SELECT name FROM people WHERE age < 15") >> df: org.apache.spark.sql.DataFrame = [name: string] >> >> scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "*true*") >> >> scala> df.explain >> == Physical Plan == >> *java.util.NoSuchElementException* >> >> Disabling the pushdown makes things work again: >> >> scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "*false*") >> >> scala> df.explain >> == Physical Plan == >> Project [name#6] >> Filter (age#7 < 15) >> Scan >> OrcRelation[file:/home/mydir/spark-1.5.0-SNAPSHOT/test/people][name#6,age#7] >> >> Have any of you run into this problem before? Is a fix available? >> >> Thanks, >> Ravi >> >> >
Re: [discuss] Removing individual commit messages from the squash commit message
+1 Sent from my iPhone On Jul 18, 2015, at 2:44 PM, Patrick Wendell pwend...@gmail.com wrote: +1 from me too On Sat, Jul 18, 2015 at 3:32 AM, Ted Yu yuzhih...@gmail.com wrote: +1 to removing commit messages. On Jul 18, 2015, at 1:35 AM, Sean Owen so...@cloudera.com wrote: +1 to removing them. Sometimes there are 50+ commits because people have been merging from master into their branch rather than rebasing. On Sat, Jul 18, 2015 at 8:48 AM, Reynold Xin r...@databricks.com wrote: I took a look at the commit messages in git log -- it looks like the individual commit messages are not that useful to include, but do make the commit messages more verbose. They are usually just a bunch of extremely concise descriptions of bug fixes, merges, etc: cb3f12d [xxx] add whitespace 6d874a6 [xxx] support pyspark for yarn-client 89b01f5 [yyy] Update the unit test to add more cases 275d252 [yyy] Address the comments 7cc146d [yyy] Address the comments 2624723 [yyy] Fix rebase conflict 45befaa [yyy] Update the unit test bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue Anybody against removing those from the merge script so the log looks cleaner? If nobody feels strongly about this, we can just create a JIRA to remove them, and only keep the author names. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: TableScan vs PrunedScan
Hi Gil You would need to prune the resulting Row as well based on the requested columns. Ram Sent from my iPhone On Jul 7, 2015, at 3:12 AM, Gil Vernik g...@il.ibm.com wrote: Hi All, I wanted to experiment a little bit with TableScan and PrunedScan. My first test was to print columns from various SQL queries. To make this test easier, i just took spark-csv and i replaced TableScan with PrunedScan. I then changed buildScan method of CsvRelation from def BuildScan = { to def buildScan(requiredColumns: Array[String]) = {… This was the only modification i did to CsvRelation.scala. And I added print of requiredColums to log. I then took the same CSV file and run very simple SELECT query on it. I noticed that when CsvRelation used TableScan - all worked correctly. But when i used PrunedScan - it didn’t worked and returned empty columns / or columns in wrong order. Why is this happens? Is it some bug? Because I thought that PrunedScan suppose to work exactly the same as TableScan and i can modify freely TableScan to PrunedScan. I thought that the only difference is that buildScan of PrunedScan has requiredColumns as parameter. Can someone explain me the behavior i saw? I am using Spark 1.5 from trunk. Thanks a lot Gil.
Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?
+1 for Hadoop 2.2+ On Fri, Jun 12, 2015 at 8:45 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I'm personally in favor, but I don't have a sense of how many people still rely on Hadoop 1. Nick 2015년 6월 12일 (금) 오전 9:13, Steve Loughran ste...@hortonworks.com님이 작성: +1 for 2.2+ Not only are the APis in Hadoop 2 better, there's more people testing Hadoop 2.x spark, and bugs in Hadoop itself being fixed. (usual disclaimers, I work off branch-2.7 snapshots I build nightly, etc) On 12 Jun 2015, at 11:09, Sean Owen so...@cloudera.com wrote: How does the idea of removing support for Hadoop 1.x for Spark 1.5 strike everyone? Really, I mean, Hadoop 2.2, as 2.2 seems to me more consistent with the modern 2.x line than 2.1 or 2.0. The arguments against are simply, well, someone out there might be using these versions. The arguments for are just simplification -- fewer gotchas in trying to keep supporting older Hadoop, of which we've seen several lately. We get to chop out a little bit of shim code and update to use some non-deprecated APIs. Along with removing support for Java 6, it might be a reasonable time to also draw a line under older Hadoop too. I'm just gauging feeling now: for, against, indifferent? I favor it, but would not push hard on it if there are objections. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.4.0 (RC4)
+1 , tested with hadoop 2.6/ yarn on centos 6.5 after building w/ -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver and ran a few SQL tests and the ML examples On Fri, Jun 5, 2015 at 10:55 AM, Hari Shreedharan hshreedha...@cloudera.com wrote: +1. Build looks good, ran a couple apps on YARN Thanks, Hari On Fri, Jun 5, 2015 at 10:52 AM, Yin Huai yh...@databricks.com wrote: Sean, Can you add -Phive -Phive-thriftserver and try those Hive tests? Thanks, Yin On Fri, Jun 5, 2015 at 5:19 AM, Sean Owen so...@cloudera.com wrote: Everything checks out again, and the tests pass for me on Ubuntu + Java 7 with '-Pyarn -Phadoop-2.6', except that I always get SparkSubmitSuite errors like ... - success sanity check *** FAILED *** java.lang.RuntimeException: [download failed: org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed: commons-net#commons-net;3.1!commons-net.jar] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978) at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62) ... I also can't get hive tests to pass. Is anyone else seeing anything like this? if not I'll assume this is something specific to the env -- or that I don't have the build invocation just right. It's puzzling since it's so consistent, but I presume others' tests pass and Jenkins does. On Wed, Jun 3, 2015 at 5:53 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc3 (commit 22596c5): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 22596c534a38cfdda91aef18aa9037ab101e4251 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: [published as version: 1.4.0] https://repository.apache.org/content/repositories/orgapachespark-/ [published as version: 1.4.0-rc4] https://repository.apache.org/content/repositories/orgapachespark-1112/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/ Please vote on releasing this package as Apache Spark 1.4.0! The vote is open until Saturday, June 06, at 05:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.4.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What has changed since RC3 == In addition to may smaller fixes, three blocker issues were fixed: 4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make metadataHive get constructed too early 6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise() 78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.3 workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.4 QA period, so -1 votes should only occur for significant regressions from 1.3.1. Bugs already present in 1.3.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Contribute code to MLlib
Hi Trevor Good point, I didn't mean that some algorithm has to be clearly better than another in every scenario to be included in MLLib. However, even if someone is willing to be the maintainer of a piece of code, it does not make sense to accept every possible algorithm into the core library. That said, the specific algorithms should be discussed in the JIRA: as you point out, there is no clear way to decide what algorithm to include and what not to, and usually mature algorithms that serve a wide variety of scenarios are easier to argue about but nothing prevents anyone from opening a ticket to discuss any specific machine learning algorithm. My suggestion was simply that for purposes of making experimental or newer algorithms available to Spark users, it doesn't necessarily have to be in the core library. Spark packages are good enough in this respect. Isn't it better for newer algorithms to take this route and prove themselves before we bring them into the core library? Especially given the barrier to using spark packages is very low. Ram On Wed, May 20, 2015 at 9:05 AM, Trevor Grant trevor.d.gr...@gmail.com wrote: Hey Ram, I'm not speaking to Tarek's package specifically but to the spirit of MLib. There are a number of method/algorithms for PCA, I'm not sure by what criterion the current one is considered 'standard'. It is rare to find ANY machine learning algo that is 'clearly better' than any other. They are all tools, they have their place and time. I agree that it makes sense to field new algorithms as packages and then integrate into MLib once they are 'proven' (in terms of stability/performance/anyone cares). That being said, if MLib takes the stance that 'what we have is good enough unless something is *clearly* better', then it will never grow into a suite with the depth and richness of sklearn. From a practitioner's stand point, its nice to have everything I could ever want ready in an 'off-the-shelf' form. 'A large number of use cases better than existing' shouldn't be a criteria when selecting what to include in MLib. The important question should be, 'Are you willing to take on responsibility for maintaining this because you may be the only person on earth who understands the mechanics AND how to code it?'. Obviously we don't want any random junk algo included. But trying to say, 'this way of doing PCA is better than that way in a large class of cases' is like trying to say 'geometry is more important than calculus in large class of cases, maybe its true- but geometry won't help you if you are in a case where you need calculus. This all relies on the assumption that MLib is destined to be a rich data science/machine learning package. It may be that the goal is to make the project as lightweight and parsimonious as possible, if so excuse me for speaking out of turn. On Tue, May 19, 2015 at 10:41 AM, Ram Sriharsha sriharsha@gmail.com wrote: Hi Trevor, Tarek You make non standard algorithms (PCA or otherwise) available to users of Spark as Spark Packages. http://spark-packages.org https://databricks.com/blog/2014/12/22/announcing-spark-packages.html With the availability of spark packages, adding powerful experimental / alternative machine learning algorithms to the pipeline has never been easier. I would suggest that route in scenarios where one machine learning algorithm is not clearly better in the common scenarios than an existing implementation in MLLib. If your algorithm is for a large class of use cases better than the existing PCA implementation, then we should open a JIRA and discuss the relative strengths/ weaknesses (perhaps with some benchmarks) so we can better understand if it makes sense to switch out the existing PCA implementation and make yours the default. Ram On Tue, May 19, 2015 at 6:56 AM, Trevor Grant trevor.d.gr...@gmail.com wrote: There are most likely advantages and disadvantages to Tarek's algorithm against the current implementation, and different scenarios where each is more appropriate. Would we not offer multiple PCA algorithms and let the user choose? Trevor Trevor Grant Data Scientist *Fortunate is he, who is able to know the causes of things. -Virgil* On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley jos...@databricks.com wrote: Hi Tarek, Thanks for your interest for checking the guidelines first! On 2 points: Algorithm: PCA is of course a critical algorithm. The main question is how your algorithm/implementation differs from the current PCA. If it's different and potentially better, I'd recommend opening up a JIRA for explaining discussing it. Java/Scala: We really do require that algorithms be in Scala, for the sake of maintainability. The conversion should be doable if you're willing since Scala is a pretty friendly language. If you create the JIRA, you could also ask for help there to see if someone can collaborate with you to convert
Re: Contribute code to MLlib
Hi Trevor I'm attaching the MLLib contribution guideline here: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines It speaks to widely known and accepted algorithms but not to whether an algorithm has to be better than another in every scenario etc I think the guideline explains what a good contribution to the core library should look like better than I initially attempted to ! Sent from my iPhone On May 20, 2015, at 9:31 AM, Ram Sriharsha sriharsha@gmail.com wrote: Hi Trevor Good point, I didn't mean that some algorithm has to be clearly better than another in every scenario to be included in MLLib. However, even if someone is willing to be the maintainer of a piece of code, it does not make sense to accept every possible algorithm into the core library. That said, the specific algorithms should be discussed in the JIRA: as you point out, there is no clear way to decide what algorithm to include and what not to, and usually mature algorithms that serve a wide variety of scenarios are easier to argue about but nothing prevents anyone from opening a ticket to discuss any specific machine learning algorithm. My suggestion was simply that for purposes of making experimental or newer algorithms available to Spark users, it doesn't necessarily have to be in the core library. Spark packages are good enough in this respect. Isn't it better for newer algorithms to take this route and prove themselves before we bring them into the core library? Especially given the barrier to using spark packages is very low. Ram On Wed, May 20, 2015 at 9:05 AM, Trevor Grant trevor.d.gr...@gmail.com wrote: Hey Ram, I'm not speaking to Tarek's package specifically but to the spirit of MLib. There are a number of method/algorithms for PCA, I'm not sure by what criterion the current one is considered 'standard'. It is rare to find ANY machine learning algo that is 'clearly better' than any other. They are all tools, they have their place and time. I agree that it makes sense to field new algorithms as packages and then integrate into MLib once they are 'proven' (in terms of stability/performance/anyone cares). That being said, if MLib takes the stance that 'what we have is good enough unless something is clearly better', then it will never grow into a suite with the depth and richness of sklearn. From a practitioner's stand point, its nice to have everything I could ever want ready in an 'off-the-shelf' form. 'A large number of use cases better than existing' shouldn't be a criteria when selecting what to include in MLib. The important question should be, 'Are you willing to take on responsibility for maintaining this because you may be the only person on earth who understands the mechanics AND how to code it?'. Obviously we don't want any random junk algo included. But trying to say, 'this way of doing PCA is better than that way in a large class of cases' is like trying to say 'geometry is more important than calculus in large class of cases, maybe its true- but geometry won't help you if you are in a case where you need calculus. This all relies on the assumption that MLib is destined to be a rich data science/machine learning package. It may be that the goal is to make the project as lightweight and parsimonious as possible, if so excuse me for speaking out of turn. On Tue, May 19, 2015 at 10:41 AM, Ram Sriharsha sriharsha@gmail.com wrote: Hi Trevor, Tarek You make non standard algorithms (PCA or otherwise) available to users of Spark as Spark Packages. http://spark-packages.org https://databricks.com/blog/2014/12/22/announcing-spark-packages.html With the availability of spark packages, adding powerful experimental / alternative machine learning algorithms to the pipeline has never been easier. I would suggest that route in scenarios where one machine learning algorithm is not clearly better in the common scenarios than an existing implementation in MLLib. If your algorithm is for a large class of use cases better than the existing PCA implementation, then we should open a JIRA and discuss the relative strengths/ weaknesses (perhaps with some benchmarks) so we can better understand if it makes sense to switch out the existing PCA implementation and make yours the default. Ram On Tue, May 19, 2015 at 6:56 AM, Trevor Grant trevor.d.gr...@gmail.com wrote: There are most likely advantages and disadvantages to Tarek's algorithm against the current implementation, and different scenarios where each is more appropriate. Would we not offer multiple PCA algorithms and let the user choose? Trevor Trevor Grant Data Scientist Fortunate is he, who is able to know the causes of things. -Virgil On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley jos...@databricks.com
Re: [discuss] ending support for Java 6?
+1 for end of support for Java 6 On Thursday, April 30, 2015 3:08 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: FYI, after enough consideration, we the Hadoop community dropped support for JDK 6 starting release Apache Hadoop 2.7.x. Thanks +Vinod On Apr 30, 2015, at 12:02 PM, Reynold Xin r...@databricks.com wrote: This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding pull request to fix that. https://issues.apache.org/jira/browse/SPARK-6869 https://issues.apache.org/jira/browse/SPARK-1920 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org