[GitHub] spark issue #14064: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

2016-07-06 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/14064
  
@cloud-fan Muchas gracias!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

2016-07-05 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13818
  
> I have a few questions.
> 
> Is it a regression from 1.6? Looks like not?

I don't know about 1.6. I know it's a regression from 1.5.

> Is it a correctness issue or a performance issue? Seems it is a 
performance issue?

It is a performance issue.

> If it is a performance issue. What is the impact? For a hive parquet/orc 
table, after we convert them to Spark's native code path, there is no 
partitioning discovery. So, I guess the performance is mainly coming from 
querying metastore? If so, what will be the perf difference after 
spark.sql.hive.metastorePartitionPruning (only querying needed partition info 
from Hive metastore) is enabled?

The problem this PR addresses occurs in the analysis phase of query 
planning. The property `spark.sql.hive.metastorePartitionPruning` only comes 
into play in `HiveTableScanExec`, which is part of physical planning. (And I 
don't believe it's used to read Parquet tables.) Therefore, that property has 
no bearing on this problem.

Regarding the impact, I'll quote from the last paragraph of the PR 
description:

> Building a large HadoopFsRelation requires stat-ing all of its data 
files. In our environment, where we have tables with 10's of thousands of 
partitions, the difference between using a cached relation versus a new one is 
a matter of seconds versus minutes. Caching partitioned table metadata vastly 
improves the usability of Spark SQL for these cases.



> My feeling is that if it is a perf issue and it is not a regression from 
1.6, merging to master should be good enough.

For some (like us), I'd say this extends beyond a performance issue into a 
usability issue. We can't use Spark 2.0 as-is if it takes us several minutes to 
build a query plan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

2016-07-05 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13818
  
@zsxwing I was able to do following without error:

git clone g...@github.com:apache/spark.git spark-master
cd spark-master
./dev/change-scala-version.sh 2.10
./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14031: [SPARK-16353][BUILD][DOC] Missing javadoc options for ja...

2016-07-04 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/14031
  
Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

2016-07-04 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13818
  
I believe I've addressed @liancheng's style issues in my new unit test, 
along with the same in the two tests from which it was copy-pasta'd (boy scout 
rule). Hopefully I didn't cock it up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14031: [SPARK-16353][BUILD][DOC] Missing javadoc options...

2016-07-04 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/14031#discussion_r69479805
  
--- Diff: project/SparkBuild.scala ---
@@ -723,8 +723,8 @@ object Unidoc {
 .map(_.filterNot(_.getCanonicalPath.contains("org/apache/hadoop")))
 },
 
-// Javadoc options: create a window title, and group key packages on 
index page
-javacOptions in doc := Seq(
+// Javadoc options: create a window title
--- End diff --

I removed the comment entirely.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14031: [SPARK-16353][BUILD][DOC] Missing javadoc options...

2016-07-02 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/14031#discussion_r69385673
  
--- Diff: project/SparkBuild.scala ---
@@ -723,8 +723,8 @@ object Unidoc {
 .map(_.filterNot(_.getCanonicalPath.contains("org/apache/hadoop")))
 },
 
-// Javadoc options: create a window title, and group key packages on 
index page
-javacOptions in doc := Seq(
+// Javadoc options: create a window title
--- End diff --

I think we can either change it to just `// Javadoc options` to clarify 
that the following `javacOptions` are in fact for Javadoc, or we can remove the 
comment entirely.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14031: [SPARK-16353][BUILD][DOC] Missing javadoc options...

2016-07-02 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/14031#discussion_r69382212
  
--- Diff: project/SparkBuild.scala ---
@@ -723,8 +723,8 @@ object Unidoc {
 .map(_.filterNot(_.getCanonicalPath.contains("org/apache/hadoop")))
 },
 
-// Javadoc options: create a window title, and group key packages on 
index page
--- End diff --

BTW, I removed the mention of package groupings because none are defined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14031: [SPARK-16353][BUILD][DOC] Missing javadoc options...

2016-07-02 Thread mallman
GitHub user mallman opened a pull request:

https://github.com/apache/spark/pull/14031

[SPARK-16353][BUILD][DOC] Missing javadoc options for java unidoc

## What changes were proposed in this pull request?

The javadoc options for the java unidoc generation are ignored when 
generating the java unidoc. For example, the generated `index.html` has the 
wrong HTML page title. This can be seen at 
http://spark.apache.org/docs/latest/api/java/index.html.

I changed the relevant setting scope from `doc` to `(JavaUnidoc, unidoc)`.

## How was this patch tested?

I ran `docs/jekyll build` and verified that the java unidoc `index.html` 
has the correct HTML page title.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VideoAmp/spark-public spark-16353

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14031.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14031


commit 939e8b5d3a3b502f3a7870d437cb38ee9564e6c4
Author: Michael Allman 
Date:   2016-07-02T19:55:39Z

[SPARK-16353][BUILD][DOC] The javadoc options for the java unidoc
generation are not honored. The scope of the relevant javacOptions key
should be `(JavaUnidoc, unidoc)` not `doc`




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

2016-06-30 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/13818#discussion_r69233431
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -265,9 +265,12 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
 PartitionDirectory(values, location)
   }
   val partitionSpec = PartitionSpec(partitionSchema, partitions)
+  val partitionPaths = partitions.map(_.path.toString)
+  val paths = partitionPaths.padTo(1, 
metastoreRelation.hiveQlTable.getDataLocation.toString)
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

2016-06-30 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13818
  
You are very welcome. Thank you for taking time to review it! 😃 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

2016-06-30 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/13818#discussion_r69231754
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -298,6 +298,7 @@ case class InsertIntoHiveTable(
 
 // Invalidate the cache.
 sqlContext.sharedState.cacheManager.invalidateCache(table)
+
sqlContext.sessionState.catalog.invalidateTable(table.catalogTable.identifier)
--- End diff --

Essentially this is a "callback" to the session catalog to invalidate this 
table in the catalog's table cache, because we just appended to the underlying 
table data. In the context of a Hive query, the session catalog's table cache 
is `HiveMetastoreCatalog.cachedDataSourceTables`. The results of the `INSERT` 
will be invisible in the current session until the table is invalidated.

Another way to think about this code is that it's precisely what makes the 
following test snippet work: 
https://github.com/VideoAmp/spark-public/blob/spark-15968/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala#L582-L585

Does that answer your question?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

2016-06-30 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/13818#discussion_r69230833
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -265,9 +265,12 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
 PartitionDirectory(values, location)
   }
   val partitionSpec = PartitionSpec(partitionSchema, partitions)
+  val partitionPaths = partitions.map(_.path.toString)
+  val paths = partitionPaths.padTo(1, 
metastoreRelation.hiveQlTable.getDataLocation.toString)
--- End diff --

It is weird but correct.

Essentially, the expression

partitionPaths.padTo(1, 
metastoreRelation.hiveQlTable.getDataLocation.toString)

will return `partitionPaths` if `partitionPaths` is nonempty and 
`metastoreRelation.hiveQlTable.getDataLocation.toString` otherwise. This fits 
the logic that seems to be present wherever partitioned tables are handled in 
the Spark codebase: use the table base location for empty partitioned tables, 
and use the partition data locations for nonempty partitioned tables. More 
specifically, the base table path is _not_ included in the latter case.

The expression you suggest will always return the table base location as 
one of the table data paths, whether `partitionPaths` is empty or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

2016-06-30 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/13818#discussion_r69159655
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -200,7 +201,6 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
 cachedDataSourceTables.getIfPresent(tableIdentifier) match {
   case null => None // Cache miss
   case logical @ LogicalRelation(relation: HadoopFsRelation, _, _) =>
-val pathsInMetastore = 
metastoreRelation.catalogTable.storage.locationUri.toSeq
--- End diff --

This is where the relation's paths are computed and the logic for empty 
versus non-empty partitioned tables diverges: 
https://github.com/VideoAmp/spark-public/blob/8a058c65c6c20e311bde5c0ade87c14c6b6b5f37/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L489-L493.

I believe this is the PR where this behavior was introduced: #13022.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

2016-06-30 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/13818#discussion_r69107723
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -191,6 +191,7 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
 
   private def getCached(
   tableIdentifier: QualifiedTableName,
+  pathsInMetastore: Seq[String],
--- End diff --

While the following commit does not remove that argument, it appears to be 
the one that changes the behavior for how partitioned tables are looked up in 
the cache:


https://github.com/apache/spark/commit/e720dda42e806229ccfd970055c7b8a93eb447bf#diff-ee66e11b56c21364760a5ed2b783f863R508

I'm not sure how to find a PR in GitHub associated with a given commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

2016-06-30 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/13818#discussion_r69106546
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -200,7 +201,6 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
 cachedDataSourceTables.getIfPresent(tableIdentifier) match {
   case null => None // Cache miss
   case logical @ LogicalRelation(relation: HadoopFsRelation, _, _) =>
-val pathsInMetastore = 
metastoreRelation.catalogTable.storage.locationUri.toSeq
--- End diff --

metastoreRelation.catalogTable.storage.locationUri.toSeq

returns the base path of the relation. This is then compared to 
`relation.location.paths` to validate the cached entry. For non-empty 
partitioned tables (by that I mean partitioned tables with one or more 
metastore partitions), `relation.location.paths` returns the locations of the 
partitions. Hence, these values will never be equal and `useCached` will always 
be `false`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

2016-06-22 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/13818#discussion_r68064203
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala ---
@@ -425,6 +425,28 @@ class ParquetMetastoreSuite extends 
ParquetPartitioningTest {
 }
   }
 
+  test("SPARK-15968: nonempty partitioned metastore Parquet table lookup 
should use cached " +
--- End diff --

I looked in `CachedTableSuite`. I'm not sure that's a good place for this 
kind of test. That test suite seems focused on testing tables cached by the 
[CacheManager](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala).
 This patch is focused on table caching in 
[HiveMetastoreCatalog](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala).

It's difficult to find the best place for these kinds of caching tests. I 
chose this file because it already had some of these tests. Perhaps 
[HiveMetastoreCatalogSuite](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala)
 would be a good candidate for an alternative?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

2016-06-21 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13818
  
@hvanhovell I'm mentioning you here because you commented on my previous PR 
for this Jira issue. In response to your original question, yes, I have added a 
unit test for this patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

2016-06-21 Thread mallman
GitHub user mallman opened a pull request:

https://github.com/apache/spark/pull/13818

[SPARK-15968][SQL] Nonempty partitioned metastore tables are not cached

(Please note this is a revision of PR #13686, which has been closed in 
favor of this PR.)

## What changes were proposed in this pull request?

The `getCached` method of `HiveMetastoreCatalog` computes 
`pathsInMetastore` from the metastore relation's catalog table. This only 
returns the table base path, which is incomplete/inaccurate for a nonempty 
partitioned table. As a result, cached lookups on nonempty partitioned tables 
always miss.

Rather than get `pathsInMetastore` from

metastoreRelation.catalogTable.storage.locationUri.toSeq

I modified the `getCached` method to take a `pathsInMetastore` argument. 
Calls to this method pass in the paths computed from calls to the Hive 
metastore. This is how `getCached` was implemented in Spark 1.5:


https://github.com/apache/spark/blob/e0c3212a9b42e3e704b070da4ac25b68c584427f/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L444.

I also added a call in `InsertIntoHiveTable.scala` to invalidate the table 
from the SQL session catalog.

## How was this patch tested?

I've added a new unit test to `parquetSuites.scala`:

SPARK-15968: nonempty partitioned metastore Parquet table lookup should 
use cached relation

Note that the only difference between this new test and the one above it in 
the file is that the new test populates its partitioned table with a single 
value, while the existing test leaves the table empty. This reveals a subtle, 
unexpected hole in test coverage present before this patch.

Note I also modified a different but related unit test in 
`parquetSuites.scala`:

SPARK-15248: explicitly added partitions should be readable

This unit test asserts that Spark SQL should return data from a table 
partition which has been placed there outside a metastore query immediately 
after it is added. I changed the test so that, instead of adding the data as a 
parquet file saved in the partition's location, the data is added through a SQL 
`INSERT` query. I made this change because I could find no way to efficiently 
support partitioned table caching without failing that test.

In addition to my primary motivation, I can offer a few reasons I believe 
this is an acceptable weakening of that test. First, it still validates a fix 
for [SPARK-15248](https://issues.apache.org/jira/browse/SPARK-15248), the issue 
for which it was written. Second, the assertion made is stronger than that 
required for non-partitioned tables. If you write data to the storage location 
of a non-partitioned metastore table without using a proper SQL DML query, a 
subsequent call to show that data will not return it. I believe this is an 
intentional limitation put in place to make table caching feasible, but I'm 
only speculating.

Building a large `HadoopFsRelation` requires `stat`-ing all of its data 
files. In our environment, where we have tables with 10's of thousands of 
partitions, the difference between using a cached relation versus a new one is 
a matter of seconds versus minutes. Caching partitioned table metadata vastly 
improves the usability of Spark SQL in this cases.

Thanks.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VideoAmp/spark-public spark-15968

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13818.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13818


commit 8a058c65c6c20e311bde5c0ade87c14c6b6b5f37
Author: Michael Allman 
Date:   2016-06-15T16:52:17Z

[SPARK-15968][SQL] HiveMetastoreCatalog does not correctly validate
partitioned metastore relation when searching the internal table cache

The `getCached` method of `HiveMetastoreCatalog` computes
`pathsInMetastore` from the metastore relation's catalog table. This
only returns the table base path, which is not correct for nonempty
partitioned tables. As a result, cached lookups on nonempty partitioned
tables always miss.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not ...

2016-06-15 Thread mallman
Github user mallman closed the pull request at:

https://github.com/apache/spark/pull/13686


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not correct...

2016-06-15 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13686
  
I'm going to close this PR and open a new one when I've fixed the test 
failures. My bad.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not correct...

2016-06-15 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13686
  
Aaaak! Some unit tests are failing on my build. Sorry, I will re-examine 
and submit a new commit. Ugh.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not correct...

2016-06-15 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13686
  
Actually, let me think about this some more...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not correct...

2016-06-15 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13686
  
@hvanhovell Sounds like a good idea, but I don't know how to unit test this 
without opening up some of this caching api to at least the `private[hive]` 
access level. Would that be acceptable? I'm open to suggestions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not ...

2016-06-15 Thread mallman
GitHub user mallman opened a pull request:

https://github.com/apache/spark/pull/13686

[SPARK-15968][SQL] HiveMetastoreCatalog does not correctly validate

## What changes were proposed in this pull request?

The `getCached` method of `HiveMetastoreCatalog` computes 
`pathsInMetastore` from the metastore relation's catalog table. This only 
returns the table base path, which is not correct for partitioned
tables. As a result, cached lookups on partitioned tables always miss, and 
these relations are always recomputed.

Rather than get `pathsInMetastore` from

metastoreRelation.catalogTable.storage.locationUri.toSeq

I modified the `getCached` method to take a `pathsInMetastore` argument. 
Calls to this method pass in the paths computed from calls to the Hive 
metastore. This is how `getCached` was implemented in Spark 1.5:


https://github.com/apache/spark/blob/e0c3212a9b42e3e704b070da4ac25b68c584427f/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L444.

## How was this patch tested?

I tested by (temporarily) adding logging to the `getCached` method and ran 
`spark.table("...")` on a partitioned table in a spark-shell before and after 
this patch. Before this patch, the value of `useCached` in `getCached` was 
`false`. After the patch it was `true`. I also validated that caching still 
works for unpartitioned tables.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VideoAmp/spark-public spark-15968

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13686.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13686


commit 60bfe10e350d245632e940aa758cec4f0d2c4006
Author: Michael Allman 
Date:   2016-06-15T16:52:17Z

[SPARK-15968][SQL] HiveMetastoreCatalog does not correctly validate
partitioned metastore relation when searching the internal table cache

The `getCached` method of `HiveMetastoreCatalog` computes
`pathsInMetastore` from the metastore relation's catalog table. This
only returns the table base path, which is not correct for partitioned
tables. As a result, cached lookups on partitioned tables always miss,
  and these relations are always recomputed.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12755][CORE] Stop the event logger befo...

2016-01-25 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/10700#issuecomment-174599806
  
Thanks, @srowen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12755][CORE] Stop the event logger befo...

2016-01-21 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/10700#issuecomment-173490509
  
Here are my current thoughts. Josh says this functionality is going to be 
removed in Spark 2.0. The bug this PR is designed to address manifests itself 
in Spark 1.5 in three ways (I'm aware of):

1. Misleading log messages from the Master (reported above).
2. Incomplete (aka "in progress") application event logs, which can be 
further divided into two scenarios:
2.a. Incomplete uncompressed event log files. The log processor can recover 
these files.
2.b. Incomplete compressed event log files. The compression output is 
truncated and unreadable by normal means. The history server reports a 
corrupted event log. I cannot definitively tie that symptom to this bug, but it 
agrees with my experience.

The most problematic of these is unrecoverable event logs. I've been 
frustrated by this before and turned off event log compression as a workaround. 
Since deploying a build with this patch to one of our dev clusters I haven't 
seen this problem again.

I don't see a simple way to write a test to support this PR.

Overall, I feel we should close this PR but keep a reference to it from 
Jira with a comment that Spark 1.5 and 1.6 users can try this patch—at their 
own risk—to address the described symptoms if they wish to. It's going into 
our own Spark 1.x builds.

I'll close this PR and the associated Jira issue within the next few days 
unless someone objects or wishes to continue discussion.

Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12755][CORE] Stop the event logger befo...

2016-01-18 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/10700#issuecomment-172564021
  
Sorry guys. I bungled the ordering of the `stop()` calls. That's what I get 
for doing a manual patch from a manual diff from another branch-1.5... 
:disappointed: This patch passes all tests on branch-1.3. Can you please kick 
off a new test build in jenkins?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12755][CORE] Stop the event logger befo...

2016-01-17 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/10700#issuecomment-172443532
  
I should also state that my original motivation in submitting this patch 
was to address the confusing log messages

Application ... is still in progress, it may be terminated abnormally.

which I saw in the Spark master log for app's that actually terminated 
normally.

Also, it's just come to mind that this bug may explain another behavior 
I've seen—sometimes an app's event log is corrupted if it was configured to 
be compressed. If the log is uncompressed then the ability for the history 
reader to decode an "in progress" event log allows it to be processed as 
expected. However, if the event log is being written through a compressed 
output stream and is not properly flushed before it is processed, then the 
processing may fail because the file compression was incomplete. (As this just 
occurred to me I haven't tested this hypothesis, but it does sound like a 
plausible explanation.) If this is the case, then this patch should correct the 
problem with corrupt compressed event logs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12755][CORE] Stop the event logger befo...

2016-01-17 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/10700#issuecomment-172440507
  
Hi Josh,

Good questions. I may have submitted this PR incorrectly. Perhaps you can 
guide me in the right direction.

I submitted this PR for merging into master because my understanding is 
that's how all PR's for the Spark project should be created. And patches 
against master may be backported to earlier releases. However, I originally 
created and tested this patch on branch-1.5 because that's what we're currently 
running. So while this patch may be irrelevant to master (or Spark 2.0), it's 
relevant to the Spark 1.5 branch and presumably 1.6 as well. Under these 
circumstances, should I have submitted a PR against master as I have done? The 
code contribution guidelines state that only in a special case would a PR be 
opened against another branch. Does a patch with no or lesser relevance to the 
master branch compared to an earlier release branch qualify as a "special 
case"? And if so, which branch should I have submitted the PR against?

Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6950][CORE] Stop the event logger befor...

2016-01-11 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/10700#issuecomment-170659870
  
Changed Jira ref from SPARK-6950 to SPARK-12755. SPARK-6950 is an older, 
defunct ticket. Oops.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6950][CORE] Stop the event logger befor...

2016-01-11 Thread mallman
GitHub user mallman opened a pull request:

https://github.com/apache/spark/pull/10700

[SPARK-6950][CORE] Stop the event logger before the DAG scheduler

[SPARK-6950][CORE] Stop the event logger before the DAG scheduler to avoid 
a race condition where the standalone master attempts to build the app's 
history UI before the event log is stopped

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VideoAmp/spark-public stop_event_logger_first

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10700.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10700


commit 7085daa95f96bcf65447e4bb5f39d9014a663614
Author: Michael Allman 
Date:   2016-01-11T19:13:01Z

[SPARK-6950][CORE] Stop the event logger before the DAG scheduler to
avoid a race condition where the standalone master attempts to build the
app's history UI before the event log is stopped




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8982][Core] Worker hostnames not showin...

2015-12-17 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/7345#issuecomment-165559632
  
@andrewor14 We've put this in production. Everything looks good. Hostnames 
show up in the UI as expected. No broken links.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8982][Core] Worker hostnames not showin...

2015-12-17 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/7345#issuecomment-165532682
  
To add my two cents, I think that to call this change "cosmetic" is 
strictly true but underrates its value. In our case we have additional 
monitoring systems that report based on local hostname. When we want to 
investigate an issue we see in monitoring in the Spark UI, not having the 
hostnames in the UI makes it more difficult than it was in Spark 1.3. Also, 
because we're in AWS and use amazon's DNS for reverse lookup, we can't do a 
reverse lookup on the IP to see the hostname. (I know that's our problem—just 
illustrating the impact from our perspective.)

With regard to a fix, I'm also in favor of a simple patch to 
`start-slave.sh`. In fact, it's something we're going to try ourselves. I'll 
try to get it done today and will report here on our experience.

Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9304] [BUILD] Improve backwards compati...

2015-07-24 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/7639#issuecomment-124568478
  
Thanks for the fix @srowen. It was my oversight to assume it was safe to 
remove these scripts in the first place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-20 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-122957103
  
@srowen I've pushed a new commit to replace usage of 
`dev/change-version-to-*.sh` scripts with `dev/change-scala-version.sh 
`. I also modified the latter so it takes a single argument instead of 
two. While this makes the current script more tied to version 2.10 and 2.11 
specifically, I believe the simplicity in usage obtained is worth deferring the 
extra complexity of programmatically determining the "from version" when we 
begin support for 2.12.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-20 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-122946797
  
@srowen I'm working on this now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-16 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-122112493
  
@srowen Sorry, I've been swamped. I think I can get this done by Saturday 
if you want to wait.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-13 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-120988236
  
@srowen @ScrapCodes Let me know if you'd like me to take on those 
additional tasks. Cheers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-09 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-120059239
  
I've pushed a commit to implement the second strategy. I've tested this 
script successfully on OS X Yosemite and Ubuntu 14.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-08 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-119708081
  
Thanks for the tip, @srowen. That works.

I now have a version following approach (2) which I've verified works on OS 
X with its built-in sed. I'll test on a GNU system later this afternoon or 
tomorrow. Assuming that works I'll push a new commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-07 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-119375569
  
The original code replaces only the first instance of 
`2.10<` in the file, which is the desired behavior. The 
code you presented replaces all of them. You can more clearly see how that's a 
problem if you run that sed commend to change from version 2.10 to 2.11, then 
back again to 2.10, and then run `git diff pom.xml`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-07 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-119333018
  
I've run into a roadblock. This syntax:

sed -e 
'0,/2.102.11

[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-07 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-119266285
  
I'll work on a revision following along the lines of what @ScrapCodes did 
and push it to this PR. Incidentally, I was going to suggest we use `mktemp` to 
create the temp files, but then I realized that tool is different on bsd versus 
gnu systems.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-06 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-118999888
  
I've spent some more time googling around this problem. Unsurprisingly, 
there's plenty of discussion/frustration around finding a cross-platform 
solution. There doesn't seem to be a magic bullet. In addition to the options 
we've discussed, I think there are two more worth considering. One is to use 
the `-i` flag but passing a backup file suffix like `sed -i.bak ...`. Then 
delete the `.bak` files. This solution is described here: 
http://stackoverflow.com/a/22084103. Another option is to use perl instead of 
sed.

That brings us to four approaches:

1. Look for GNU sed and exit if it can't be found. (what I did)
1. Write sed output to temporary files and move the temporary files to the 
original files. (what @ScrapCodes did)
1. Use the `-i.bak` syntax and delete the `.bak` files. This works on GNU 
and BSD sed, but maybe not XYZ or PDQ sed. The point being that `-i` in any 
form just isn't part of the posix standard.
1. Use `perl -i ...`, which is cross-platform.

At this point I would lean toward either (2) or (4), with (4) being the 
cleaner (easier) implementation. It also assumes the system has perl. Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-05 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-118743436
  
@srowen I just returned from my vacation abroad and am catching up. Sorry 
for the wait. I'll take a look at this tomorrow. Cheers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-06-19 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-113548582
  
Sure thing. FYI, I'm leaving for Iceland tomorrow (Saturday), and I'll be 
away for two weeks. I will probably be incommunicado during this time. If you 
need something from me, can we put this PR on hold? Or if you find a 
cross-platform `sed` syntax that works, I won't mind if you just push that 
change. Let's keep the jackson and scaladoc fixes either way.

Cheers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-06-18 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-113206728
  
Indeed the build does generate the scaladoc in the right location, but the 
`docs/_plugin/copy_api_dirs.rb` is currently hardcoded to always look for the 
api docs in `target/scala-2.10`:


https://github.com/apache/spark/blob/7af3818c6b2bf35bfa531ab7cc3a4a714385015e/docs/_plugins/copy_api_dirs.rb#L37


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-06-18 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/6832#discussion_r32703376
  
--- Diff: dev/change-scala-version.sh ---
@@ -0,0 +1,63 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+usage() {
+  echo "Usage: $(basename $0)  " 1>&2
+  exit 1
+}
+
+if [ $# -ne 2 ]; then
+  echo "Wrong number of arguments" 1>&2
+  usage
+fi
+
+FROM_VERSION=$1
+TO_VERSION=$2
+
+VALID_VERSIONS=( 2.10 2.11 )
+
+check_scala_version() {
+  for i in ${VALID_VERSIONS[*]}; do [ $i = "$1" ] && return 0; done
+  echo "Invalid Scala version: $1. Valid versions: ${VALID_VERSIONS[*]}" 
1>&2
+  exit 1
+}
+
+check_scala_version "$FROM_VERSION"
+check_scala_version "$TO_VERSION"
+
+test_sed() {
+  [ ! -z "$($1 --version 2>&1 | head -n 1 | grep 'GNU sed')" ]
+}
+
+# Find GNU sed. On OS X with MacPorts you can install gsed with "sudo port 
install gsed"
+if test_sed sed; then
+  SED=sed
+elif test_sed gsed; then
+  SED=gsed
+else
+  echo "Could not find GNU sed. Tried \"sed\" and \"gsed\"" 1>&2
+  exit 1
+fi
+
+BASEDIR=$(dirname $0)/..
+find $BASEDIR -name 'pom.xml' | grep -v target \
+  | xargs -I {} $SED -i -e 
's/\(artifactId.*\)_'$FROM_VERSION'/\1_'$TO_VERSION'/g' {}
+
+# Update source of scaladocs
+$SED -i -e 's/scala\-'$FROM_VERSION'/scala\-'$TO_VERSION'/' 
$BASEDIR/docs/_plugins/copy_api_dirs.rb
--- End diff --

That makes sense. I also better remember why I removed that line—it has 
to do with some incorrect behavior in changing `scala.binary.version` which I 
mistakenly attributed to that portion of the script. Absent any apparent 
problem, I'm going to restore that part of that script.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-06-17 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/6832#discussion_r32650306
  
--- Diff: dev/change-scala-version.sh ---
@@ -0,0 +1,63 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+usage() {
+  echo "Usage: $(basename $0)  " 1>&2
+  exit 1
+}
+
+if [ $# -ne 2 ]; then
+  echo "Wrong number of arguments" 1>&2
+  usage
+fi
+
+FROM_VERSION=$1
+TO_VERSION=$2
+
+VALID_VERSIONS=( 2.10 2.11 )
+
+check_scala_version() {
+  for i in ${VALID_VERSIONS[*]}; do [ $i = "$1" ] && return 0; done
+  echo "Invalid Scala version: $1. Valid versions: ${VALID_VERSIONS[*]}" 
1>&2
+  exit 1
+}
+
+check_scala_version "$FROM_VERSION"
+check_scala_version "$TO_VERSION"
+
+test_sed() {
+  [ ! -z "$($1 --version 2>&1 | head -n 1 | grep 'GNU sed')" ]
+}
+
+# Find GNU sed. On OS X with MacPorts you can install gsed with "sudo port 
install gsed"
+if test_sed sed; then
+  SED=sed
+elif test_sed gsed; then
+  SED=gsed
+else
+  echo "Could not find GNU sed. Tried \"sed\" and \"gsed\"" 1>&2
+  exit 1
+fi
+
+BASEDIR=$(dirname $0)/..
+find $BASEDIR -name 'pom.xml' | grep -v target \
+  | xargs -I {} $SED -i -e 
's/\(artifactId.*\)_'$FROM_VERSION'/\1_'$TO_VERSION'/g' {}
+
+# Update source of scaladocs
+$SED -i -e 's/scala\-'$FROM_VERSION'/scala\-'$TO_VERSION'/' 
$BASEDIR/docs/_plugins/copy_api_dirs.rb
--- End diff --

It used to, yes. However, that property is now set through activating the 
appropriate maven profile—`scala-2.10` or `scala-2.11`. Changing it through a 
textual find/replace will override that logic. For example, the `scala-2.10` 
profile will compile with scala 2.11 or vice-versa.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-06-17 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/6832#discussion_r32649538
  
--- Diff: dev/change-scala-version.sh ---
@@ -0,0 +1,63 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+usage() {
+  echo "Usage: $(basename $0)  " 1>&2
+  exit 1
+}
+
+if [ $# -ne 2 ]; then
+  echo "Wrong number of arguments" 1>&2
+  usage
+fi
+
+FROM_VERSION=$1
+TO_VERSION=$2
+
+VALID_VERSIONS=( 2.10 2.11 )
+
+check_scala_version() {
+  for i in ${VALID_VERSIONS[*]}; do [ $i = "$1" ] && return 0; done
+  echo "Invalid Scala version: $1. Valid versions: ${VALID_VERSIONS[*]}" 
1>&2
+  exit 1
+}
+
+check_scala_version "$FROM_VERSION"
+check_scala_version "$TO_VERSION"
+
+test_sed() {
+  [ ! -z "$($1 --version 2>&1 | head -n 1 | grep 'GNU sed')" ]
+}
+
+# Find GNU sed. On OS X with MacPorts you can install gsed with "sudo port 
install gsed"
+if test_sed sed; then
--- End diff --

I can, but it seems that idiom is used to simply check for the existence of 
a binary. We want to test that `sed --version` outputs `GNU sed`. `command -v` 
would be inadequate for that purpose.

However, if you still want me to then I guess

test_sed() {
  [ $(command -v $1) -a ! -z "$($1 --version 2>&1 | head -n 1 | grep 
'GNU sed')" ]
}

is the way to go? Or perhaps `test_gnu_sed` would be a better name for that 
function?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-06-16 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-112608540
  
@srowen I created the Jira ticket which shows the problem with the current 
version changing scripts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Scala version switching build enhancements

2015-06-16 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-112478712
  
@srowen Should I create one Jira ticket for this or multiple?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Scala version switching build enhancements

2015-06-16 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-112477102
  
@srowen Will create a Jira ticket.

@ScrapCodes This is what I get with (presumably BSD) sed on OS X:

```
[msa@Michaels-MacBook-Pro spark-1.4]$ ./dev/change-version-to-2.11.sh 
[msa@Michaels-MacBook-Pro spark-1.4]$ gst
On branch scala-versions
Your branch and 'vamp/scala-versions' have diverged,
and have 7 and 4 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)
Changes not staged for commit:
  (use "git add ..." to update what will be committed)
  (use "git checkout -- ..." to discard changes in working directory)

modified:   assembly/pom.xml
modified:   bagel/pom.xml
modified:   core/pom.xml
modified:   dev/change-scala-version.sh
modified:   docs/_plugins/copy_api_dirs.rb
modified:   examples/pom.xml
modified:   external/flume-sink/pom.xml
modified:   external/flume/pom.xml
modified:   external/kafka-assembly/pom.xml
modified:   external/kafka/pom.xml
modified:   external/mqtt/pom.xml
modified:   external/twitter/pom.xml
modified:   external/zeromq/pom.xml
modified:   extras/java8-tests/pom.xml
modified:   extras/kinesis-asl/pom.xml
modified:   extras/spark-ganglia-lgpl/pom.xml
modified:   graphx/pom.xml
modified:   launcher/pom.xml
modified:   mllib/pom.xml
modified:   network/common/pom.xml
modified:   network/shuffle/pom.xml
modified:   network/yarn/pom.xml
modified:   pom.xml
modified:   repl/pom.xml
modified:   sql/catalyst/pom.xml
modified:   sql/core/pom.xml
modified:   sql/hive-thriftserver/pom.xml
modified:   sql/hive/pom.xml
modified:   streaming/pom.xml
modified:   tools/pom.xml
modified:   unsafe/pom.xml
modified:   yarn/pom.xml

Untracked files:
  (use "git add ..." to include in what will be committed)

assembly/pom.xml-e
bagel/pom.xml-e
core/pom.xml-e
dev/audit-release/blank_maven_build/pom.xml-e
dev/audit-release/maven_app_core/pom.xml-e
docs/_plugins/copy_api_dirs.rb-e
examples/pom.xml-e
external/flume-sink/pom.xml-e
external/flume/pom.xml-e
external/kafka-assembly/pom.xml-e
external/kafka/pom.xml-e
external/mqtt/pom.xml-e
external/twitter/pom.xml-e
external/zeromq/pom.xml-e
extras/java8-tests/pom.xml-e
extras/kinesis-asl/pom.xml-e
extras/spark-ganglia-lgpl/pom.xml-e
graphx/pom.xml-e
launcher/pom.xml-e
mllib/pom.xml-e
network/common/pom.xml-e
network/shuffle/pom.xml-e
network/yarn/pom.xml-e
pom.xml-e
repl/pom.xml-e
sql/catalyst/pom.xml-e
sql/core/pom.xml-e
sql/hive-thriftserver/pom.xml-e
sql/hive/pom.xml-e
streaming/pom.xml-e
tools/pom.xml-e
unsafe/pom.xml-e
yarn/pom.xml-e

no changes added to commit (use "git add" and/or "git commit -a")
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Scala version switching build enhancements

2015-06-15 Thread mallman
GitHub user mallman opened a pull request:

https://github.com/apache/spark/pull/6832

Scala version switching build enhancements

These commits address a few minor issues in the Scala cross-version support 
in the build:

  1. Correct two missing `${scala.binary.version}` pom file substitutions.
  2. Don't update `scala.binary.version` in parent POM. This property is 
set through profiles.
  3. Update the source of the generated scaladocs in 
`docs/_plugins/copy_api_dirs.rb`.
  4. Factor common code out of `dev/change-version-to-*.sh` and add some 
validation. We also test `sed` to see if it's GNU sed and try `gsed` as an 
alternative if not. This prevents the script from running with a non-GNU sed.

This is my first contribution to Spark. After reading the contributor 
guidelines I wasn't sure if this qualified for a Jira ticket, or if I should 
split these commits into separate contributions. I'm happy to amend my pull 
request or file a Jira ticket as requested.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VideoAmp/spark-1 scala-versions

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/6832.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6832


commit 475088e7423b59dd7a9ef4b0ab426ebcc3e555ba
Author: Michael Allman 
Date:   2015-06-14T19:17:56Z

Replace jackson-module-scala_2.10 with
jackson-module-scala_${scala.binary.version}

commit e5a675e3411bdfa3e6e387f6688f330fb8e5067b
Author: Michael Allman 
Date:   2015-06-14T19:20:36Z

Don't update scala.binary.version in parent POM. This property is set
through profiles

commit a896dfdff26e56f8c6f0a4769b1d576e3494a524
Author: Michael Allman 
Date:   2015-06-14T19:34:57Z

Update source of scaladocs when changing Scala version

commit 2b926205ff863c1132f3e1d923ff4715ee566bbc
Author: Michael Allman 
Date:   2015-06-14T19:36:10Z

Factor change-scala-version.sh out of change-version-to-*.sh, adding
command line argument validation and testing for GNU sed




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Initial window function...

2015-03-19 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/3703#issuecomment-83691288
  
I'm confused. Why was this PR abruptly closed? Was there another active PR 
for window functions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2087] [SQL] [WIP] Multiple thriftserver...

2015-02-05 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/4382#issuecomment-73081758
  
FWIW I'd like to add my two cents. The main piece of functionality the 
installation at my company would benefit from is independent user sessions. I'm 
not familiar enough with the source to say exactly what that means in terms of 
a source patch, but one of the key use cases is the ability to set the session 
default database ("use ") and SQLConf settings independent of other 
beeline connections. Right now, setting the database sets it across all 
connections and that is a major impediment to wider use of a shared 
thriftserver.

Cheers!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1266] persist factors in implicit ALS

2014-03-18 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/165#discussion_r10714229
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala ---
@@ -187,21 +189,39 @@ class ALS private (
   }
 }
 
-for (iter <- 1 to iterations) {
-  // perform ALS update
-  logInfo("Re-computing I given U (Iteration %d/%d)".format(iter, 
iterations))
-  // YtY / XtX is an Option[DoubleMatrix] and is only required for the 
implicit feedback model
-  val YtY = computeYtY(users)
-  val YtYb = ratings.context.broadcast(YtY)
-  products = updateFeatures(users, userOutLinks, productInLinks, 
partitioner, rank, lambda,
-alpha, YtYb)
-  logInfo("Re-computing U given I (Iteration %d/%d)".format(iter, 
iterations))
-  val XtX = computeYtY(products)
-  val XtXb = ratings.context.broadcast(XtX)
-  users = updateFeatures(products, productOutLinks, userInLinks, 
partitioner, rank, lambda,
-alpha, XtXb)
+if (implicitPrefs) {
+  for (iter <- 1 to iterations) {
+// perform ALS update
+logInfo("Re-computing I given U (Iteration %d/%d)".format(iter, 
iterations))
+// Persist users because it will be called twice.
+users.persist()
+val YtY = Some(sc.broadcast(computeYtY(users)))
+val previousProducts = products
+products = updateFeatures(users, userOutLinks, productInLinks, 
partitioner, rank, lambda,
+  alpha, YtY)
+previousProducts.unpersist()
+logInfo("Re-computing U given I (Iteration %d/%d)".format(iter, 
iterations))
+products.persist()
+val XtX = Some(sc.broadcast(computeYtY(products)))
+val previousUsers = users
+users = updateFeatures(products, productOutLinks, userInLinks, 
partitioner, rank, lambda,
+  alpha, XtX)
+previousUsers.unpersist()
+  }
+} else {
+  for (iter <- 1 to iterations) {
+// perform ALS update
+logInfo("Re-computing I given U (Iteration %d/%d)".format(iter, 
iterations))
+products = updateFeatures(users, userOutLinks, productInLinks, 
partitioner, rank, lambda,
+  alpha, YtY = None)
+logInfo("Re-computing U given I (Iteration %d/%d)".format(iter, 
iterations))
+users = updateFeatures(products, productOutLinks, userInLinks, 
partitioner, rank, lambda,
+  alpha, YtY = None)
+  }
 }
 
+products.persist()
--- End diff --

Why is this here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


<    2   3   4   5   6   7