reply: I want to contribute MLlib two quality measures(ARHR and HR) for top N recommendation system. Is this meaningful?

2014-08-27 Thread Lizhengbing (bing, BIPA)
In fact,  prec@k is similar to HR and ndcg@k is similar to ARHR
After my study, I cannot find a best measure to evaluate recommendation system

Xiangrui, do you think it is reasonable to create a class to provide popular 
measures for evaluating recommendation system?

Popular measures of recommendation system include precision, coverage, 
diversity…
Most measures can be found in the book(Recommender_systems_handbook)




发件人: Xiangrui Meng [mailto:men...@gmail.com]
发送时间: 2014年8月26日 3:28
收件人: Lizhengbing (bing, BIPA)
抄送: dev@spark.apache.org
主题: Re: I want to contribute MLlib two quality measures(ARHR and HR) for top N 
recommendation system. Is this meaningful?

The evaluation metrics are definitely useful. How do they differ from 
traditional IR metrics like prec@k and ndcg@k? -Xiangrui

On Mon, Aug 25, 2014 at 2:14 AM, Lizhengbing (bing, BIPA) 
zhengbing...@huawei.commailto:zhengbing...@huawei.com wrote:
Hi:
In paper “Item-Based Top-N Recommendation 
Algorithms”(https://stuyresearch.googlecode.com/hg/blake/resources/10.1.1.102.4451.pdf),
 there are two parameters measuring the quality of recommendation: HR and ARHR.
If I use ALS(Implicit) for top-N recommendation system, I want to check it’s 
quality. ARHR and HR are two good quality measures.
I want to contribute them to spark MLlib.  So I want to know whether this is 
meaningful?


(1) If n is the total number of customers/users,  the hit-rate of the 
recommendation algorithm was computed as
hit-rate (HR) = Number of hits / n

(2)If h is the number of hits that occurred at positions p1, p2, . . . , ph 
within the top-N lists (i.e., 1 ≤ pi ≤ N), then the average reciprocal hit-rank 
is equal to:
i
.



Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread RJ Nowling
Hi Yu,

A standardized API has not been implemented yet.  I think it would be
better to implement the other clustering algorithms then extract a common
API.  Others may feel differently.  :)

Just a note, there was a pre-existing JIRA for hierarchical KMeans
SPARK-2429 https://issues.apache.org/jira/browse/SPARK-2429 I filed.  I
added a comment about previous discussion on the mailing list, example code
provided by a Jeremy Freeman, and a couple of papers I found.

Feel free to take this over -- I've played with it but haven't had time to
finish it.  I'd be happy to review the resulting code and discuss
approaches with you.

RJ



On Wed, Aug 13, 2014 at 9:20 AM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com
wrote:

 Hi all,

 I am also interested in specifying a common framework.
 And I am trying to implement a hierarchical k-means and a hierarchical
 clustering like single-link method with LSH.
 https://issues.apache.org/jira/browse/SPARK-2966

 If you have designed the standardized clustering algorithms API, please let
 me know.


 best,
 Yu Ishikawa



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7822.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread Jeremy Freeman
Hey RJ,

Sorry for the delay, I'd be happy to take a look at this if you can post the 
code!

I think splitting the largest cluster in each round is fairly common, but 
ideally it would be an option to do it one way or the other.

-- Jeremy

-
jeremy freeman, phd
neuroscientist
@thefreemanlab

On Aug 12, 2014, at 2:20 PM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,
 
 I wanted to follow up.
 
 I have a prototype for an optimized version of hierarchical k-means.  I
 wanted to get some feedback on my apporach.
 
 Jeremy's implementation splits the largest cluster in each round.  Is it
 better to do it that way or to split each cluster in half?
 
 Are there are any open-source examples that are being widely used in
 production?
 
 Thanks!
 
 
 
 On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling rnowl...@gmail.com wrote:
 
 Nice to meet you, Jeremy!
 
 This is great!  Hierarchical clustering was next on my list --
 currently trying to get my PR for MiniBatch KMeans accepted.
 
 If it's cool with you, I'll try converting your code to fit in with
 the existing MLLib code as you suggest. I also need to review the
 Decision Tree code (as suggested above) to see how much of that can be
 reused.
 
 Maybe I can ask you to do a code review for me when I'm done?
 
 
 
 
 
 On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
 freeman.jer...@gmail.com wrote:
 Hi all,
 
 Cool discussion! I agree that a more standardized API for clustering, and
 easy access to underlying routines, would be useful (we've also been
 discussing this when trying to develop streaming clustering algorithms,
 similar to https://github.com/apache/spark/pull/1361)
 
 For divisive, hierarchical clustering I implemented something awhile
 back,
 here's a gist.
 
 https://gist.github.com/freeman-lab/5947e7c53b368fe90371
 
 It does bisecting k-means clustering (with k=2), with a recursive class
 for
 keeping track of the tree. I also found this much better than
 agglomerative
 methods (for the reasons Hector points out).
 
 This needs to be cleaned up, and can surely be optimized (esp. by
 replacing
 the core KMeans step with existing MLLib code), but I can say I was
 running
 it successfully on quite large data sets.
 
 RJ, depending on where you are in your progress, I'd be happy to help
 work
 on this piece and / or have you use this as a jumping off point, if
 useful.
 
 -- Jeremy
 
 
 
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.
 
 
 
 --
 em rnowl...@gmail.com
 c 954.496.2314
 
 
 
 
 -- 
 em rnowl...@gmail.com
 c 954.496.2314



Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread RJ Nowling
Thanks, Jeremy.  I'm abandoning my initial approach, and I'll work on
optimizing your example (so it doesn't do the breeze-vector conversions
every time KMeans is called).  I need to finish a few other projects first,
though, so it may be a couple weeks.

In the mean time, Yu also created a JIRA for a hierarchical KMeans
implementation.  I pointed him to your example and a couple papers I found.

If you or Yu beat me to getting an implementation in, I'd be happy to
review it.  :)


On Wed, Aug 27, 2014 at 12:18 PM, Jeremy Freeman freeman.jer...@gmail.com
wrote:

 Hey RJ,

 Sorry for the delay, I'd be happy to take a look at this if you can post
 the code!

 I think splitting the largest cluster in each round is fairly common, but
 ideally it would be an option to do it one way or the other.

 -- Jeremy

 -
 jeremy freeman, phd
 neuroscientist
 @thefreemanlab

 On Aug 12, 2014, at 2:20 PM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,

 I wanted to follow up.

 I have a prototype for an optimized version of hierarchical k-means.  I
 wanted to get some feedback on my apporach.

 Jeremy's implementation splits the largest cluster in each round.  Is it
 better to do it that way or to split each cluster in half?

 Are there are any open-source examples that are being widely used in
 production?

 Thanks!



 On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling rnowl...@gmail.com wrote:

 Nice to meet you, Jeremy!

 This is great!  Hierarchical clustering was next on my list --
 currently trying to get my PR for MiniBatch KMeans accepted.

 If it's cool with you, I'll try converting your code to fit in with
 the existing MLLib code as you suggest. I also need to review the
 Decision Tree code (as suggested above) to see how much of that can be
 reused.

 Maybe I can ask you to do a code review for me when I'm done?





 On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
 freeman.jer...@gmail.com wrote:

 Hi all,

 Cool discussion! I agree that a more standardized API for clustering, and
 easy access to underlying routines, would be useful (we've also been
 discussing this when trying to develop streaming clustering algorithms,
 similar to https://github.com/apache/spark/pull/1361)

 For divisive, hierarchical clustering I implemented something awhile

 back,

 here's a gist.

 https://gist.github.com/freeman-lab/5947e7c53b368fe90371

 It does bisecting k-means clustering (with k=2), with a recursive class

 for

 keeping track of the tree. I also found this much better than

 agglomerative

 methods (for the reasons Hector points out).

 This needs to be cleaned up, and can surely be optimized (esp. by

 replacing

 the core KMeans step with existing MLLib code), but I can say I was

 running

 it successfully on quite large data sets.

 RJ, depending on where you are in your progress, I'd be happy to help

 work

 on this piece and / or have you use this as a jumping off point, if

 useful.


 -- Jeremy



 --
 View this message in context:


 http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html

 Sent from the Apache Spark Developers List mailing list archive at

 Nabble.com.



 --
 em rnowl...@gmail.com
 c 954.496.2314




 --
 em rnowl...@gmail.com
 c 954.496.2314





-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Adding support for a new object store

2014-08-27 Thread Reynold Xin
Hi Rajendran,

I'm assuming you have some concept of schema and you are intending to
integrate with SchemaRDD instead of normal RDDs.

More responses inline below.


On Fri, Aug 22, 2014 at 2:21 AM, Rajendran Appavu appra...@in.ibm.com
wrote:


  I am new to Spark source code and looking to see if i can add push-down
 support of spark filters to the storage (in my
  case an object store). I am willing to consider how this can be
 generically done for any store that we might want to
  integrate with spark. I am looking to know the areas that I should look
 into to provide support for a new data store in
  this context. Following below are some of the questions I have to start
 with:

  1. Do we need to create a new RDD class for the new store that we want to
 support? From Spark Context, we create an RDD
  and the operations on data including the filter are performed through the
 RDD methods.


You can create a new RDD type for a new storage system, and you can create
a new table scan operator in sql to read.


  2. When we specify the code for filter task in the RDD.filter() method,
 how does it get communicated to the Executor on
  the data node? Does the Executor need to compile this code on the fly and
 execute it? or how does it work? ( I have
  looked at the code for sometime, but not yet got to figuring this out, so
 i am looking for some pointers that can help me
  come a little up-to-speed in this part of the code)


Right now the best way to do this is to hack the sql strategies, which does
some predicate pushdown into table scan:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

We are in the process of proposing an API that allows external data stores
to hook into the planner. Expect a design proposal in early/mid Sept.

Once that is in place, you wouldn't need to hack the planner anymore. It is
a good idea to start prototyping by hacking the planner, and migrate to the
planner hook API once that is ready.



  3. How long the Executor holds the memory? and how does it decide when to
 release the memory/cache?


Executors by default actually don't hold any data in memory. Spark requires
explicit caching of data, i.e. it's only when rdd.cache() is called then
will Spark executors put the content of that RDD in-memory. The executor
has a thing called BlockManager that does eviction based on LRU.




  Thank you in advance.





 Regards,
 Rajendran.


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: RDD replication in Spark

2014-08-27 Thread Cheng Lian
You may start from here
https://github.com/apache/spark/blob/4fa2fda88fc7beebb579ba808e400113b512533b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L706-L712
.
​


On Mon, Aug 25, 2014 at 9:05 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:

 Hi,

 I've exercised multiple options available for persist() including  RDD
 replication. I have gone thru the classes that involve in caching/storing
 the RDDS at different levels. StorageLevel class plays a pivotal role by
 recording whether to use memory or disk or to replicate the RDD on multiple
 nodes.
 The class LocationIterator iterates over the preferred machines one by
 one  for
 each partition that is replicated. I got a rough idea of CoalescedRDD.
 Please correct me if I am wrong.

 But I am looking for the code that chooses the resources to replicate the
 RDDs. Can someone please tell me how replication takes place and how do we
 choose the resources for replication. I just want to know as to where
 should I look into to understand how the replication happens.



 Thank you so much!!!

 regards

 -Karthik



Re: Handling stale PRs

2014-08-27 Thread Nicholas Chammas
On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote:

 Last weekend, I started hacking on a Google App Engine app for helping
 with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png).


BTW Josh, how can we stay up-to-date on your work on this tool? A JIRA
issue, perhaps?

Nick


[GraphX] JIRA / PR to fix breakage in GraphGenerator.logNormalGraph in PR #720

2014-08-27 Thread RJ Nowling
Hi all,

 PR #720 https://github.com/apache/spark/pull/720 made multiple changes
to GraphGenerator.logNormalGraph including:

   - Replacing the call to functions for generating random vertices and
   edges with in-line implementations with different equations. Based on
   reading the Pregel paper, I believe the in-line functions are incorrect.
   - Hard-coding of RNG seeds so that method now generates the same graph
   for a given number of vertices, edges, mu, and sigma -- user is not able to
   override seed or specify that seed should be randomly generated.
   - Backwards-incompatible change to logNormalGraph signature with
   introduction of new required parameter.
   - Failed to update scala docs and programming guide for API changes
   - Added a Synthetic Benchmark in the examples.

I submitted JIRA SPARK-3263
https://issues.apache.org/jira/browse/SPARK-3263 and PR #2168
https://github.com/apache/spark/pull/2168 to revert some of these changes
and fix usage of the RNGs:

   - Removes the in-line calls and calls original vertex / edge generation
   functions again
   - Adds an optional seed parameter for deterministic behavior (when
   desired)
   - Keeps the number of partitions parameter that was added.
   - Keeps compatibility with the synthetic benchmark example
   - Maintains backwards-compatible API

 I would appreciate feedback and people taking a look.  :)

Thanks!
RJ

-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Adding support for a new object store

2014-08-27 Thread Reynold Xin
Linking to the JIRA tracking APIs to hook into the planner:
https://issues.apache.org/jira/browse/SPARK-3248




On Wed, Aug 27, 2014 at 1:56 PM, Reynold Xin r...@databricks.com wrote:

 Hi Rajendran,

 I'm assuming you have some concept of schema and you are intending to
 integrate with SchemaRDD instead of normal RDDs.

 More responses inline below.


 On Fri, Aug 22, 2014 at 2:21 AM, Rajendran Appavu appra...@in.ibm.com
 wrote:


  I am new to Spark source code and looking to see if i can add push-down
 support of spark filters to the storage (in my
  case an object store). I am willing to consider how this can be
 generically done for any store that we might want to
  integrate with spark. I am looking to know the areas that I should look
 into to provide support for a new data store in
  this context. Following below are some of the questions I have to start
 with:

  1. Do we need to create a new RDD class for the new store that we want
 to support? From Spark Context, we create an RDD
  and the operations on data including the filter are performed through
 the RDD methods.


 You can create a new RDD type for a new storage system, and you can create
 a new table scan operator in sql to read.


  2. When we specify the code for filter task in the RDD.filter() method,
 how does it get communicated to the Executor on
  the data node? Does the Executor need to compile this code on the fly
 and execute it? or how does it work? ( I have
  looked at the code for sometime, but not yet got to figuring this out,
 so i am looking for some pointers that can help me
  come a little up-to-speed in this part of the code)


 Right now the best way to do this is to hack the sql strategies, which
 does some predicate pushdown into table scan:
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

 We are in the process of proposing an API that allows external data stores
 to hook into the planner. Expect a design proposal in early/mid Sept.

 Once that is in place, you wouldn't need to hack the planner anymore. It
 is a good idea to start prototyping by hacking the planner, and migrate to
 the planner hook API once that is ready.



  3. How long the Executor holds the memory? and how does it decide when
 to release the memory/cache?


 Executors by default actually don't hold any data in memory. Spark
 requires explicit caching of data, i.e. it's only when rdd.cache() is
 called then will Spark executors put the content of that RDD in-memory. The
 executor has a thing called BlockManager that does eviction based on LRU.




  Thank you in advance.





 Regards,
 Rajendran.


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Re: Handling stale PRs

2014-08-27 Thread Nishkam Ravi
Wonder if it would make sense to introduce a notion of 'Reviewers' as an
intermediate tier to help distribute the load? While anyone can review and
comment on an open PR, reviewers would be able to say aye or nay subject to
confirmation by a committer?

Thanks,
Nishkam


On Wed, Aug 27, 2014 at 2:11 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote:

  Last weekend, I started hacking on a Google App Engine app for helping
  with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png).
 

 BTW Josh, how can we stay up-to-date on your work on this tool? A JIRA
 issue, perhaps?

 Nick



Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-27 Thread Cheng Lian
I believe in your case, the “magic” happens in TableReader.fillObject
https://github.com/apache/spark/blob/4fa2fda88fc7beebb579ba808e400113b512533b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L706-L712.
Here we unwrap the field value according to the object inspector of that
field. It seems that somehow a FloatObjectInspector is specified for the
total_price field. I don’t think CSVSerde is responsible for this, since it
sets all field object inspectors to javaStringObjectInspector (here
https://github.com/ogrodnek/csv-serde/blob/f315c1ae4b21a8288eb939e7c10f3b29c1a854ef/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L59-L61
).

Which version of Spark SQL are you using? If you are using a snapshot
version, please provide the exact Git commit hash. Thanks!
​


On Tue, Aug 26, 2014 at 8:29 AM, chutium teng@gmail.com wrote:

 oops, i tried on a managed table, column types will not be changed

 so it is mostly due to the serde lib CSVSerDe
 (
 https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L123
 )
 or maybe CSVReader from opencsv?...

 but if the columns are defined as string, no matter what type returned from
 custom SerDe or CSVReader, they should be cast to string at the end right?

 why do not use the schema from hive metadata directly?



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8039.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Handling stale PRs

2014-08-27 Thread Patrick Wendell
Hey Nishkam,

To some extent we already have this process - many community members
help review patches and some earn a reputation where committer's will
take an LGTM from them seriously. I'd be interested in seeing if any
other projects recognize people who do this.

- Patrick

On Wed, Aug 27, 2014 at 2:36 PM, Nishkam Ravi nr...@cloudera.com wrote:
 Wonder if it would make sense to introduce a notion of 'Reviewers' as an
 intermediate tier to help distribute the load? While anyone can review and
 comment on an open PR, reviewers would be able to say aye or nay subject to
 confirmation by a committer?

 Thanks,
 Nishkam


 On Wed, Aug 27, 2014 at 2:11 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:

 On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote:

  Last weekend, I started hacking on a Google App Engine app for helping
  with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png).
 

 BTW Josh, how can we stay up-to-date on your work on this tool? A JIRA
 issue, perhaps?

 Nick



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Handling stale PRs

2014-08-27 Thread Josh Rosen
I have a very simple dashboard running at http://spark-prs.appspot.com/.  
Currently, this mirrors the functionality of Patrick’s github-shim, but it 
should be very easy to extend with other features.

The source is at https://github.com/databricks/spark-pr-dashboard (pull 
requests and issues welcome!)

On August 27, 2014 at 2:11:41 PM, Nicholas Chammas (nicholas.cham...@gmail.com) 
wrote:

On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote:
Last weekend, I started hacking on a Google App Engine app for helping with 
pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png).

BTW Josh, how can we stay up-to-date on your work on this tool? A JIRA issue, 
perhaps?

Nick

Re: Handling stale PRs

2014-08-27 Thread Nicholas Chammas
Alright! That was quick. :)


On Wed, Aug 27, 2014 at 6:48 PM, Josh Rosen rosenvi...@gmail.com wrote:

 I have a very simple dashboard running at http://spark-prs.appspot.com/.
  Currently, this mirrors the functionality of Patrick’s github-shim, but it
 should be very easy to extend with other features.

 The source is at https://github.com/databricks/spark-pr-dashboard (pull
 requests and issues welcome!)

 On August 27, 2014 at 2:11:41 PM, Nicholas Chammas (
 nicholas.cham...@gmail.com) wrote:

  On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com wrote:

  Last weekend, I started hacking on a Google App Engine app for helping
 with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png).


 BTW Josh, how can we stay up-to-date on your work on this tool? A JIRA
 issue, perhaps?

 Nick




Re: jenkins maintenance/downtime, aug 28th, 730am-9am PDT

2014-08-27 Thread Nicholas Chammas
Looks like we're currently at 1.568 so we should be getting a nice slew of
UI tweaks and bug fixes. Neat!


On Wed, Aug 27, 2014 at 7:13 PM, shane knapp skn...@berkeley.edu wrote:

 tomorrow morning i will be upgrading jenkins to the latest/greatest
 (1.577).

 at 730am, i will put jenkins in to a quiet period, so no new builds will be
 accepted.  once any running builds are finished, i will be taking jenkins
 down for the upgrade.

 depending on what and how many jobs are running, i'm expecting this to
 take, at most, an hour.

 i'll send out an update tomorrow morning right before i begin, and will
 send out updates and an all-clear once we're up and running again.

 1.577 release notes:
 http://jenkins-ci.org/changelog

 please let me know if there are any questions/concerns.  thanks in advance!

 shane



Re: Handling stale PRs

2014-08-27 Thread Nishkam Ravi
I see. Yeah, it would be interesting to know if any other project has
considered formalizing this notion. It may also enable assignment of
reviews (potentially automated using Josh's system) and maybe anonymity as
well? On the downside, it isn't easily implemented and probably doesn't
come without certain undesired side-effects.

Thanks,
Nishkam


On Wed, Aug 27, 2014 at 3:39 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Nishkam,

 To some extent we already have this process - many community members
 help review patches and some earn a reputation where committer's will
 take an LGTM from them seriously. I'd be interested in seeing if any
 other projects recognize people who do this.

 - Patrick

 On Wed, Aug 27, 2014 at 2:36 PM, Nishkam Ravi nr...@cloudera.com wrote:
  Wonder if it would make sense to introduce a notion of 'Reviewers' as an
  intermediate tier to help distribute the load? While anyone can review
 and
  comment on an open PR, reviewers would be able to say aye or nay subject
 to
  confirmation by a committer?
 
  Thanks,
  Nishkam
 
 
  On Wed, Aug 27, 2014 at 2:11 PM, Nicholas Chammas
  nicholas.cham...@gmail.com wrote:
 
  On Tue, Aug 26, 2014 at 2:21 PM, Josh Rosen rosenvi...@gmail.com
 wrote:
 
   Last weekend, I started hacking on a Google App Engine app for helping
   with pull request review (screenshot: http://i.imgur.com/wwpZKYZ.png
 ).
  
 
  BTW Josh, how can we stay up-to-date on your work on this tool? A JIRA
  issue, perhaps?
 
  Nick
 
 



Update on Pig on Spark initiative

2014-08-27 Thread Mayur Rustagi
Hi,
We have migrated Pig functionality on top of Spark passing 100% e2e for
success cases in pig test suite. That means UDF, Joins  other
functionality is working quite nicely. We are in the process of merging
with Apache Pig trunk(something that should happen over the next 2 weeks).
Meanwhile if you are interested in giving it a go, you can try it at
https://github.com/sigmoidanalytics/spork
This contains all the major changes but may not have all the patches
required for 100% e2e, if you are trying it out let me know any issues you
face

Whole bunch of folks contributed on this

Julien Le Dem (Twitter),  Praveen R (Sigmoid Analytics), Akhil Das (Sigmoid
Analytics), Bill Graham (Twitter), Dmitriy Ryaboy (Twitter), Kamal Banga
(Sigmoid Analytics), Anish Haldiya (Sigmoid Analytics),  Aniket Mokashi
 (Google), Greg Owen (DataBricks), Amit Kumar Behera (Sigmoid Analytics),
Mahesh Kalakoti (Sigmoid Analytics)

Not to mention Spark  Pig communities.

Regards
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi


Re: Update on Pig on Spark initiative

2014-08-27 Thread Matei Zaharia
Awesome to hear this, Mayur! Thanks for putting this together.

Matei

On August 27, 2014 at 10:04:12 PM, Mayur Rustagi (mayur.rust...@gmail.com) 
wrote:

Hi,
We have migrated Pig functionality on top of Spark passing 100% e2e for success 
cases in pig test suite. That means UDF, Joins  other functionality is working 
quite nicely. We are in the process of merging with Apache Pig trunk(something 
that should happen over the next 2 weeks). 
Meanwhile if you are interested in giving it a go, you can try it at 
https://github.com/sigmoidanalytics/spork
This contains all the major changes but may not have all the patches required 
for 100% e2e, if you are trying it out let me know any issues you face

Whole bunch of folks contributed on this 

Julien Le Dem (Twitter),  Praveen R (Sigmoid Analytics), Akhil Das (Sigmoid 
Analytics), Bill Graham (Twitter), Dmitriy Ryaboy (Twitter), Kamal Banga 
(Sigmoid Analytics), Anish Haldiya (Sigmoid Analytics),  Aniket Mokashi  
(Google), Greg Owen (DataBricks), Amit Kumar Behera (Sigmoid Analytics), Mahesh 
Kalakoti (Sigmoid Analytics)

Not to mention Spark  Pig communities. 

Regards
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi



[Spark SQL] query nested structure data

2014-08-27 Thread wenchen
I am going to dig into this issue:
https://issues.apache.org/jira/browse/SPARK-2096
However, I noticed that there is already a NestedSqlParser in sql/core/test
org.apache.spark.sql.parquet. 
I checked this parser and it could solve the issue I mentioned before. But
why the author of the parser mark it as temporarily? Does this parser break
some spark sql grammar which I haven't noticed?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-query-nested-structure-data-tp8083.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org