Re: Welcoming two new committers

2016-02-08 Thread Ram Sriharsha
great job guys! congrats and welcome!

On Mon, Feb 8, 2016 at 12:05 PM, Amit Chavan <achav...@gmail.com> wrote:

> Welcome.
>
> On Mon, Feb 8, 2016 at 2:50 PM, Suresh Thalamati <
> suresh.thalam...@gmail.com> wrote:
>
>> Congratulations Herman and Wenchen!
>>
>> On Mon, Feb 8, 2016 at 10:59 AM, Andrew Or <and...@databricks.com> wrote:
>>
>>> Welcome!
>>>
>>> 2016-02-08 10:55 GMT-08:00 Bhupendra Mishra <bhupendra.mis...@gmail.com>
>>> :
>>>
>>>> Congratulations to both. and welcome to group.
>>>>
>>>> On Mon, Feb 8, 2016 at 10:45 PM, Matei Zaharia <matei.zaha...@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> The PMC has recently added two new Spark committers -- Herman van
>>>>> Hovell and Wenchen Fan. Both have been heavily involved in Spark SQL and
>>>>> Tungsten, adding new features, optimizations and APIs. Please join me in
>>>>> welcoming Herman and Wenchen.
>>>>>
>>>>> Matei
>>>>> -
>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Ram Sriharsha
Architect, Spark and Data Science
Hortonworks, 2550 Great America Way, 2nd Floor
Santa Clara, CA 95054
Ph: 408-510-8635
email: har...@apache.org

[image: https://www.linkedin.com/in/harsha340]
<https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
<https://github.com/harsha2010/>


Re: Predicate push-down bug?

2015-09-15 Thread Ram Sriharsha
Hi Ravi

This does look like a bug.. I have created a JIRA to track it here:

https://issues.apache.org/jira/browse/SPARK-10623

Ram

On Tue, Sep 15, 2015 at 10:47 AM, Ram Sriharsha <sriharsha@gmail.com>
wrote:

> Hi Ravi
>
> Can you share more details? What Spark version are you running?
>
> Ram
>
> On Tue, Sep 15, 2015 at 10:32 AM, Ravi Ravi <i.am.ravi.r...@gmail.com>
> wrote:
>
>> Turning on predicate pushdown for ORC datasources results in a
>> NoSuchElementException:
>>
>> scala> val df = sqlContext.sql("SELECT name FROM people WHERE age < 15")
>> df: org.apache.spark.sql.DataFrame = [name: string]
>>
>> scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "*true*")
>>
>> scala> df.explain
>> == Physical Plan ==
>> *java.util.NoSuchElementException*
>>
>> Disabling the pushdown makes things work again:
>>
>> scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "*false*")
>>
>> scala> df.explain
>> == Physical Plan ==
>> Project [name#6]
>>  Filter (age#7 < 15)
>>   Scan
>> OrcRelation[file:/home/mydir/spark-1.5.0-SNAPSHOT/test/people][name#6,age#7]
>>
>> Have any of you run into this problem before? Is a fix available?
>>
>> Thanks,
>> Ravi
>>
>>
>


Re: [discuss] Removing individual commit messages from the squash commit message

2015-07-18 Thread Ram Sriharsha
+1 

Sent from my iPhone

 On Jul 18, 2015, at 2:44 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 +1 from me too
 
 On Sat, Jul 18, 2015 at 3:32 AM, Ted Yu yuzhih...@gmail.com wrote:
 +1 to removing commit messages.
 
 
 
 On Jul 18, 2015, at 1:35 AM, Sean Owen so...@cloudera.com wrote:
 
 +1 to removing them. Sometimes there are 50+ commits because people
 have been merging from master into their branch rather than rebasing.
 
 On Sat, Jul 18, 2015 at 8:48 AM, Reynold Xin r...@databricks.com wrote:
 I took a look at the commit messages in git log -- it looks like the
 individual commit messages are not that useful to include, but do make the
 commit messages more verbose. They are usually just a bunch of extremely
 concise descriptions of bug fixes, merges, etc:
 
   cb3f12d [xxx] add whitespace
   6d874a6 [xxx] support pyspark for yarn-client
 
   89b01f5 [yyy] Update the unit test to add more cases
   275d252 [yyy] Address the comments
   7cc146d [yyy] Address the comments
   2624723 [yyy] Fix rebase conflict
   45befaa [yyy] Update the unit test
   bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue
 
 
 Anybody against removing those from the merge script so the log looks
 cleaner? If nobody feels strongly about this, we can just create a JIRA to
 remove them, and only keep the author names.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: TableScan vs PrunedScan

2015-07-07 Thread Ram Sriharsha
Hi Gil

You would need to prune the resulting Row as well based on the requested 
columns.

Ram

Sent from my iPhone

 On Jul 7, 2015, at 3:12 AM, Gil Vernik g...@il.ibm.com wrote:
 
 Hi All, 
 
 I wanted to experiment a little bit with TableScan and PrunedScan. 
 My first test was to print columns from various SQL queries.  
 To make this test easier, i just took spark-csv and i replaced TableScan with 
 PrunedScan. 
 I then changed buildScan method of CsvRelation from 
 
 def BuildScan = { 
 
 to  
 
 def buildScan(requiredColumns: Array[String]) = {… 
 
 This was the only modification i did to CsvRelation.scala.  And I added print 
 of requiredColums to log. 
 
 I then took the same CSV file and run very simple SELECT query on it. 
 I noticed that when CsvRelation used TableScan - all worked correctly. 
 But when i used PrunedScan - it didn’t worked and returned empty columns / or 
 columns in wrong order.  
 
 Why is this happens? Is it some bug? Because I thought that PrunedScan 
 suppose to work exactly the same as TableScan and i can modify freely 
 TableScan to PrunedScan. I thought that the only difference is that buildScan 
 of PrunedScan has requiredColumns as parameter. 
 
 Can someone explain me the behavior i saw? 
 
 I am using Spark 1.5 from trunk. 
 Thanks a lot 
 Gil.


Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Ram Sriharsha
+1 for Hadoop 2.2+

On Fri, Jun 12, 2015 at 8:45 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 I'm personally in favor, but I don't have a sense of how many people still
 rely on Hadoop 1.

 Nick

 2015년 6월 12일 (금) 오전 9:13, Steve Loughran
 ste...@hortonworks.com님이 작성:

 +1 for 2.2+

 Not only are the APis in Hadoop 2 better, there's more people testing
 Hadoop 2.x  spark, and bugs in Hadoop itself being fixed.

 (usual disclaimers, I work off branch-2.7 snapshots I build nightly, etc)

  On 12 Jun 2015, at 11:09, Sean Owen so...@cloudera.com wrote:
 
  How does the idea of removing support for Hadoop 1.x for Spark 1.5
  strike everyone? Really, I mean, Hadoop  2.2, as 2.2 seems to me more
  consistent with the modern 2.x line than 2.1 or 2.0.
 
  The arguments against are simply, well, someone out there might be
  using these versions.
 
  The arguments for are just simplification -- fewer gotchas in trying
  to keep supporting older Hadoop, of which we've seen several lately.
  We get to chop out a little bit of shim code and update to use some
  non-deprecated APIs. Along with removing support for Java 6, it might
  be a reasonable time to also draw a line under older Hadoop too.
 
  I'm just gauging feeling now: for, against, indifferent?
  I favor it, but would not push hard on it if there are objections.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-05 Thread Ram Sriharsha
+1 , tested  with hadoop 2.6/ yarn on centos 6.5 after building  w/ -Pyarn
-Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver and ran a
few SQL tests and the ML examples

On Fri, Jun 5, 2015 at 10:55 AM, Hari Shreedharan hshreedha...@cloudera.com
 wrote:

 +1. Build looks good, ran a couple apps on YARN


 Thanks,
 Hari

 On Fri, Jun 5, 2015 at 10:52 AM, Yin Huai yh...@databricks.com wrote:

 Sean,

 Can you add -Phive -Phive-thriftserver and try those Hive tests?

 Thanks,

 Yin

 On Fri, Jun 5, 2015 at 5:19 AM, Sean Owen so...@cloudera.com wrote:

 Everything checks out again, and the tests pass for me on Ubuntu +
 Java 7 with '-Pyarn -Phadoop-2.6', except that I always get
 SparkSubmitSuite errors like ...

 - success sanity check *** FAILED ***
   java.lang.RuntimeException: [download failed:
 org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed:
 commons-net#commons-net;3.1!commons-net.jar]
   at
 org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978)
   at
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62)
   ...

 I also can't get hive tests to pass. Is anyone else seeing anything
 like this? if not I'll assume this is something specific to the env --
 or that I don't have the build invocation just right. It's puzzling
 since it's so consistent, but I presume others' tests pass and Jenkins
 does.


 On Wed, Jun 3, 2015 at 5:53 AM, Patrick Wendell pwend...@gmail.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark
 version 1.4.0!
 
  The tag to be voted on is v1.4.0-rc3 (commit 22596c5):
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
  22596c534a38cfdda91aef18aa9037ab101e4251
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  [published as version: 1.4.0]
 
 https://repository.apache.org/content/repositories/orgapachespark-/
  [published as version: 1.4.0-rc4]
 
 https://repository.apache.org/content/repositories/orgapachespark-1112/
 
  The documentation corresponding to this release can be found at:
 
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/
 
  Please vote on releasing this package as Apache Spark 1.4.0!
 
  The vote is open until Saturday, June 06, at 05:00 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.4.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == What has changed since RC3 ==
  In addition to may smaller fixes, three blocker issues were fixed:
  4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make
  metadataHive get constructed too early
  6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise()
  78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be
 singleton
 
  == How can I help test this release? ==
  If you are a Spark user, you can help us test this release by
  taking a Spark 1.3 workload and running on this release candidate,
  then reporting any regressions.
 
  == What justifies a -1 vote for this release? ==
  This vote is happening towards the end of the 1.4 QA period,
  so -1 votes should only occur for significant regressions from 1.3.1.
  Bugs already present in 1.3.X, minor regressions, or bugs related
  to new features will not block this release.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org






Re: Contribute code to MLlib

2015-05-20 Thread Ram Sriharsha
Hi Trevor

Good point, I didn't mean that some algorithm has to be clearly better than
another in every scenario to be included in MLLib. However, even if someone
is willing to be the maintainer of a piece of code, it does not make sense
to accept every possible algorithm into the core library.

That said, the specific algorithms should be discussed in the JIRA: as you
point out, there is no clear way to decide what algorithm to include and
what not to, and usually mature algorithms that serve a wide variety of
scenarios are easier to argue about but nothing prevents anyone from
opening a ticket to discuss any specific machine learning algorithm.

My suggestion was simply that for purposes of making experimental or newer
algorithms available to Spark users, it doesn't necessarily have to be in
the core library. Spark packages are good enough in this respect.

Isn't it better for newer algorithms to take this route and prove
themselves before we bring them into the core library? Especially given the
barrier to using spark packages is very low.

Ram



On Wed, May 20, 2015 at 9:05 AM, Trevor Grant trevor.d.gr...@gmail.com
wrote:

 Hey Ram,

 I'm not speaking to Tarek's package specifically but to the spirit of
 MLib.  There are a number of method/algorithms for PCA, I'm not sure by
 what criterion the current one is considered 'standard'.

 It is rare to find ANY machine learning algo that is 'clearly better' than
 any other.  They are all tools, they have their place and time.  I agree
 that it makes sense to field new algorithms as packages and then integrate
 into MLib once they are 'proven' (in terms of stability/performance/anyone
 cares).  That being said, if MLib takes the stance that 'what we have is
 good enough unless something is *clearly* better', then it will never
 grow into a suite with the depth and richness of sklearn. From a
 practitioner's stand point, its nice to have everything I could ever want
 ready in an 'off-the-shelf' form.

 'A large number of use cases better than existing' shouldn't be a criteria
 when selecting what to include in MLib.  The important question should be,
 'Are you willing to take on responsibility for maintaining this because you
 may be the only person on earth who understands the mechanics AND how to
 code it?'.   Obviously we don't want any random junk algo included.  But
 trying to say, 'this way of doing PCA is better than that way in a large
 class of cases' is like trying to say 'geometry is more important than
 calculus in large class of cases, maybe its true- but geometry won't help
 you if you are in a case where you need calculus.

 This all relies on the assumption that MLib is destined to be a rich data
 science/machine learning package.  It may be that the goal is to make the
 project as lightweight and parsimonious as possible, if so excuse me for
 speaking out of turn.


 On Tue, May 19, 2015 at 10:41 AM, Ram Sriharsha sriharsha@gmail.com
 wrote:

 Hi Trevor, Tarek

 You make non standard algorithms (PCA or otherwise) available to users of
 Spark as Spark Packages.
 http://spark-packages.org
 https://databricks.com/blog/2014/12/22/announcing-spark-packages.html

 With the availability of spark packages, adding powerful experimental /
 alternative machine learning algorithms to the pipeline has never been
 easier. I would suggest that route in scenarios where one machine learning
 algorithm is not clearly better in the common scenarios than an existing
 implementation in MLLib.

 If your algorithm is for a large class of use cases better than the
 existing PCA implementation, then we should open a JIRA and discuss the
 relative strengths/ weaknesses (perhaps with some benchmarks) so we can
 better understand if it makes sense to switch out the existing PCA
 implementation and make yours the default.

 Ram

 On Tue, May 19, 2015 at 6:56 AM, Trevor Grant trevor.d.gr...@gmail.com
 wrote:

  There are most likely advantages and disadvantages to Tarek's algorithm
 against the current implementation, and different scenarios where each is
 more appropriate.

 Would we not offer multiple PCA algorithms and let the user choose?

 Trevor

 Trevor Grant
 Data Scientist

 *Fortunate is he, who is able to know the causes of things.  -Virgil*


 On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley jos...@databricks.com
 wrote:

 Hi Tarek,

 Thanks for your interest  for checking the guidelines first!  On 2
 points:

 Algorithm: PCA is of course a critical algorithm.  The main question is
 how your algorithm/implementation differs from the current PCA.  If it's
 different and potentially better, I'd recommend opening up a JIRA for
 explaining  discussing it.

 Java/Scala: We really do require that algorithms be in Scala, for the
 sake of maintainability.  The conversion should be doable if you're willing
 since Scala is a pretty friendly language.  If you create the JIRA, you
 could also ask for help there to see if someone can collaborate with you to
 convert

Re: Contribute code to MLlib

2015-05-20 Thread Ram Sriharsha
Hi Trevor

I'm attaching the MLLib contribution guideline here:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines

It speaks to widely known and accepted algorithms but not to whether an 
algorithm has to be better than another in every scenario etc

I think the guideline explains what a good contribution to the core library 
should look like better than I initially attempted to !

Sent from my iPhone

 On May 20, 2015, at 9:31 AM, Ram Sriharsha sriharsha@gmail.com wrote:
 
 Hi Trevor
 
 Good point, I didn't mean that some algorithm has to be clearly better than 
 another in every scenario to be included in MLLib. However, even if someone 
 is willing to be the maintainer of a piece of code, it does not make sense to 
 accept every possible algorithm into the core library.
 
 That said, the specific algorithms should be discussed in the JIRA: as you 
 point out, there is no clear way to decide what algorithm to include and what 
 not to, and usually mature algorithms that serve a wide variety of scenarios 
 are easier to argue about but nothing prevents anyone from opening a ticket 
 to discuss any specific machine learning algorithm.
 
 My suggestion was simply that for purposes of making experimental or newer 
 algorithms available to Spark users, it doesn't necessarily have to be in the 
 core library. Spark packages are good enough in this respect.
 
 Isn't it better for newer algorithms to take this route and prove themselves 
 before we bring them into the core library? Especially given the barrier to 
 using spark packages is very low.
 
 Ram
 
 
 
 On Wed, May 20, 2015 at 9:05 AM, Trevor Grant trevor.d.gr...@gmail.com 
 wrote:
 Hey Ram,
 
 I'm not speaking to Tarek's package specifically but to the spirit of MLib.  
 There are a number of method/algorithms for PCA, I'm not sure by what 
 criterion the current one is considered 'standard'.  
 
 It is rare to find ANY machine learning algo that is 'clearly better' than 
 any other.  They are all tools, they have their place and time.  I agree 
 that it makes sense to field new algorithms as packages and then integrate 
 into MLib once they are 'proven' (in terms of stability/performance/anyone 
 cares).  That being said, if MLib takes the stance that 'what we have is 
 good enough unless something is clearly better', then it will never grow 
 into a suite with the depth and richness of sklearn. From a practitioner's 
 stand point, its nice to have everything I could ever want ready in an 
 'off-the-shelf' form. 
 
 'A large number of use cases better than existing' shouldn't be a criteria 
 when selecting what to include in MLib.  The important question should be, 
 'Are you willing to take on responsibility for maintaining this because you 
 may be the only person on earth who understands the mechanics AND how to 
 code it?'.   Obviously we don't want any random junk algo included.  But 
 trying to say, 'this way of doing PCA is better than that way in a large 
 class of cases' is like trying to say 'geometry is more important than 
 calculus in large class of cases, maybe its true- but geometry won't help 
 you if you are in a case where you need calculus. 
 
 This all relies on the assumption that MLib is destined to be a rich data 
 science/machine learning package.  It may be that the goal is to make the 
 project as lightweight and parsimonious as possible, if so excuse me for 
 speaking out of turn. 
   
 
 On Tue, May 19, 2015 at 10:41 AM, Ram Sriharsha sriharsha@gmail.com 
 wrote:
 Hi Trevor, Tarek
 
 You make non standard algorithms (PCA or otherwise) available to users of 
 Spark as Spark Packages.
 http://spark-packages.org
 https://databricks.com/blog/2014/12/22/announcing-spark-packages.html
 
 With the availability of spark packages, adding powerful experimental / 
 alternative machine learning algorithms to the pipeline has never been 
 easier. I would suggest that route in scenarios where one machine learning 
 algorithm is not clearly better in the common scenarios than an existing 
 implementation in MLLib.
 
 If your algorithm is for a large class of use cases better than the 
 existing PCA implementation, then we should open a JIRA and discuss the 
 relative strengths/ weaknesses (perhaps with some benchmarks) so we can 
 better understand if it makes sense to switch out the existing PCA 
 implementation and make yours the default.
 
 Ram
 
 On Tue, May 19, 2015 at 6:56 AM, Trevor Grant trevor.d.gr...@gmail.com 
 wrote:
  There are most likely advantages and disadvantages to Tarek's algorithm 
 against the current implementation, and different scenarios where each is 
 more appropriate.
 
 Would we not offer multiple PCA algorithms and let the user choose?
 
 Trevor
 
 Trevor Grant
 Data Scientist
 
 Fortunate is he, who is able to know the causes of things.  -Virgil
 
 
 On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley jos...@databricks.com

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Ram Sriharsha
+1 for end of support for Java 6 


 On Thursday, April 30, 2015 3:08 PM, Vinod Kumar Vavilapalli 
vino...@hortonworks.com wrote:
   

 FYI, after enough consideration, we the Hadoop community dropped support for 
JDK 6 starting release Apache Hadoop 2.7.x.

Thanks
+Vinod

On Apr 30, 2015, at 12:02 PM, Reynold Xin r...@databricks.com wrote:

 This has been discussed a few times in the past, but now Oracle has ended
 support for Java 6 for over a year, I wonder if we should just drop Java 6
 support.
 
 There is one outstanding issue Tom has brought to my attention: PySpark on
 YARN doesn't work well with Java 7/8, but we have an outstanding pull
 request to fix that.
 
 https://issues.apache.org/jira/browse/SPARK-6869
 https://issues.apache.org/jira/browse/SPARK-1920


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org