spark git commit: [SPARK-12030] Fix Platform.copyMemory to handle overlapping regions.

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/branch-1.6 ab2a124c8 -> 1cf9d3858 [SPARK-12030] Fix Platform.copyMemory to handle overlapping regions. This bug was exposed as memory corruption in Timsort which uses copyMemory to copy large regions that can overlap. The prior implementation

spark git commit: [SPARK-12030] Fix Platform.copyMemory to handle overlapping regions.

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/master 34e7093c1 -> 2cef1cdfb [SPARK-12030] Fix Platform.copyMemory to handle overlapping regions. This bug was exposed as memory corruption in Timsort which uses copyMemory to copy large regions that can overlap. The prior implementation did

spark git commit: [SPARK-12004] Preserve the RDD partitioner through RDD checkpointing

2015-12-01 Thread andrewor14
Repository: spark Updated Branches: refs/heads/branch-1.6 1cf9d3858 -> 81db8d086 [SPARK-12004] Preserve the RDD partitioner through RDD checkpointing The solution is the save the RDD partitioner in a separate file in the RDD checkpoint directory. That is, `/_partitioner`. In most cases,

spark git commit: [SPARK-12004] Preserve the RDD partitioner through RDD checkpointing

2015-12-01 Thread andrewor14
Repository: spark Updated Branches: refs/heads/master 2cef1cdfb -> 60b541ee1 [SPARK-12004] Preserve the RDD partitioner through RDD checkpointing The solution is the save the RDD partitioner in a separate file in the RDD checkpoint directory. That is, `/_partitioner`. In most cases,

spark git commit: [SPARK-11961][DOC] Add docs of ChiSqSelector

2015-12-01 Thread jkbradley
Repository: spark Updated Branches: refs/heads/branch-1.6 21909b8ac -> 5647774b0 [SPARK-11961][DOC] Add docs of ChiSqSelector https://issues.apache.org/jira/browse/SPARK-11961 Author: Xusen Yin Closes #9965 from yinxusen/SPARK-11961. (cherry picked from commit

spark git commit: [SPARK-11961][DOC] Add docs of ChiSqSelector

2015-12-01 Thread jkbradley
Repository: spark Updated Branches: refs/heads/master 328b757d5 -> e76431f88 [SPARK-11961][DOC] Add docs of ChiSqSelector https://issues.apache.org/jira/browse/SPARK-11961 Author: Xusen Yin Closes #9965 from yinxusen/SPARK-11961. Project:

spark git commit: [SPARK-11788][SQL] surround timestamp/date value with quotes in JDBC data source

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/master 47a0abc34 -> 5a8b5fdd6 [SPARK-11788][SQL] surround timestamp/date value with quotes in JDBC data source When query the Timestamp or Date column like the following val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg &&

spark git commit: [SPARK-11328][SQL] Improve error message when hitting this issue

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/master ef6790fdc -> 47a0abc34 [SPARK-11328][SQL] Improve error message when hitting this issue The issue is that the output commiter is not idempotent and retry attempts will fail because the output file already exists. It is not safe to clean up

spark git commit: [SPARK-11352][SQL] Escape */ in the generated comments.

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/master 5a8b5fdd6 -> 5872a9d89 [SPARK-11352][SQL] Escape */ in the generated comments. https://issues.apache.org/jira/browse/SPARK-11352 Author: Yin Huai Closes #10072 from yhuai/SPARK-11352. Project:

spark git commit: [SPARK-11328][SQL] Improve error message when hitting this issue

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/branch-1.6 d77bf0bd9 -> f1122dd2b [SPARK-11328][SQL] Improve error message when hitting this issue The issue is that the output commiter is not idempotent and retry attempts will fail because the output file already exists. It is not safe to

spark git commit: [SPARK-12075][SQL] Speed up HiveComparisionTest by avoiding / speeding up TestHive.reset()

2015-12-01 Thread rxin
Repository: spark Updated Branches: refs/heads/master f292018f8 -> ef6790fdc [SPARK-12075][SQL] Speed up HiveComparisionTest by avoiding / speeding up TestHive.reset() When profiling HiveCompatibilitySuite, I noticed that most of the time seems to be spent in expensive `TestHive.reset()`

spark git commit: [SPARK-12075][SQL] Speed up HiveComparisionTest by avoiding / speeding up TestHive.reset()

2015-12-01 Thread rxin
Repository: spark Updated Branches: refs/heads/branch-1.6 012de2ce5 -> d77bf0bd9 [SPARK-12075][SQL] Speed up HiveComparisionTest by avoiding / speeding up TestHive.reset() When profiling HiveCompatibilitySuite, I noticed that most of the time seems to be spent in expensive

spark git commit: [SPARK-11328][SQL] Improve error message when hitting this issue

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/branch-1.5 80dac0b07 -> f28399e1a [SPARK-11328][SQL] Improve error message when hitting this issue The issue is that the output commiter is not idempotent and retry attempts will fail because the output file already exists. It is not safe to

spark git commit: Revert "[SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serialize"

2015-12-01 Thread zsxwing
Repository: spark Updated Branches: refs/heads/branch-1.6 81db8d086 -> 21909b8ac Revert "[SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serialize" This reverts commit 9b99b2b46c452ba396e922db5fc7eec02c45b158. Project: http://git-wip-us.apache.org/repos/asf/spark/repo

spark git commit: [SPARK-12030] Fix Platform.copyMemory to handle overlapping regions.

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/branch-1.5 fc3fb8463 -> 7460e4309 [SPARK-12030] Fix Platform.copyMemory to handle overlapping regions. This bug was exposed as memory corruption in Timsort which uses copyMemory to copy large regions that can overlap. The prior implementation

spark git commit: Revert "[SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serialize"

2015-12-01 Thread zsxwing
Repository: spark Updated Branches: refs/heads/master 60b541ee1 -> 328b757d5 Revert "[SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serialize" This reverts commit 1401166576c7018c5f9c31e0a6703d5fb16ea339. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit:

spark git commit: [SPARK-12002][STREAMING][PYSPARK] Fix python direct stream checkpoint recovery issue

2015-12-01 Thread zsxwing
Repository: spark Updated Branches: refs/heads/branch-1.6 5647774b0 -> 012de2ce5 [SPARK-12002][STREAMING][PYSPARK] Fix python direct stream checkpoint recovery issue Fixed a minor race condition in #10017 Closes #10017 Author: jerryshao Author: Shixiong Zhu

spark git commit: [SPARK-12002][STREAMING][PYSPARK] Fix python direct stream checkpoint recovery issue

2015-12-01 Thread zsxwing
Repository: spark Updated Branches: refs/heads/master e76431f88 -> f292018f8 [SPARK-12002][STREAMING][PYSPARK] Fix python direct stream checkpoint recovery issue Fixed a minor race condition in #10017 Closes #10017 Author: jerryshao Author: Shixiong Zhu

spark git commit: [SPARK-11352][SQL] Escape */ in the generated comments.

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/branch-1.6 1135430a0 -> 14eadf921 [SPARK-11352][SQL] Escape */ in the generated comments. https://issues.apache.org/jira/browse/SPARK-11352 Author: Yin Huai Closes #10072 from yhuai/SPARK-11352. (cherry picked from

spark git commit: [SPARK-11788][SQL] surround timestamp/date value with quotes in JDBC data source

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/branch-1.6 f1122dd2b -> 1135430a0 [SPARK-11788][SQL] surround timestamp/date value with quotes in JDBC data source When query the Timestamp or Date column like the following val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg &&

spark git commit: [SPARK-11788][SQL] surround timestamp/date value with quotes in JDBC data source

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/branch-1.5 f28399e1a -> fc3fb8463 [SPARK-11788][SQL] surround timestamp/date value with quotes in JDBC data source When query the Timestamp or Date column like the following val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg &&

spark git commit: [SPARK-11596][SQL] In TreeNode's argString, if a TreeNode is not a child of the current TreeNode, we should only return the simpleString.

2015-12-01 Thread marmbrus
Repository: spark Updated Branches: refs/heads/master 5872a9d89 -> e96a70d5a [SPARK-11596][SQL] In TreeNode's argString, if a TreeNode is not a child of the current TreeNode, we should only return the simpleString. In TreeNode's argString, if a TreeNode is not a child of the current

spark git commit: [SPARK-11596][SQL] In TreeNode's argString, if a TreeNode is not a child of the current TreeNode, we should only return the simpleString.

2015-12-01 Thread marmbrus
Repository: spark Updated Branches: refs/heads/branch-1.6 14eadf921 -> 1b3db967e [SPARK-11596][SQL] In TreeNode's argString, if a TreeNode is not a child of the current TreeNode, we should only return the simpleString. In TreeNode's argString, if a TreeNode is not a child of the current

spark git commit: [SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serialize

2015-12-01 Thread zsxwing
Repository: spark Updated Branches: refs/heads/branch-1.6 add4e6311 -> 9b99b2b46 [SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serialize `JavaSerializerInstance.serialize` uses `ByteArrayOutputStream.toByteArray` to get the serialized data.

spark git commit: [SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serialize

2015-12-01 Thread zsxwing
Repository: spark Updated Branches: refs/heads/master c87531b76 -> 140116657 [SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serialize `JavaSerializerInstance.serialize` uses `ByteArrayOutputStream.toByteArray` to get the serialized data. `ByteArrayOutputStream.toByteArray`

spark git commit: [SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues

2015-12-01 Thread marmbrus
Repository: spark Updated Branches: refs/heads/master 140116657 -> 69dbe6b40 [SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues This PR backports PR #10039 to master Author: Cheng Lian Closes #10063 from liancheng/spark-12046.doc-fix.master. Project:

spark git commit: [SPARK-12068][SQL] use a single column in Dataset.groupBy and count will fail

2015-12-01 Thread marmbrus
Repository: spark Updated Branches: refs/heads/master 69dbe6b40 -> 8ddc55f1d [SPARK-12068][SQL] use a single column in Dataset.groupBy and count will fail The reason is that, for a single culumn `RowEncoder`(or a single field product encoder), when we use it as the encoder for grouping key,

spark git commit: [SPARK-12068][SQL] use a single column in Dataset.groupBy and count will fail

2015-12-01 Thread marmbrus
Repository: spark Updated Branches: refs/heads/branch-1.6 9b99b2b46 -> 6e3e3c648 [SPARK-12068][SQL] use a single column in Dataset.groupBy and count will fail The reason is that, for a single culumn `RowEncoder`(or a single field product encoder), when we use it as the encoder for grouping

spark git commit: [SPARK-11856][SQL] add type cast if the real type is different but compatible with encoder schema

2015-12-01 Thread marmbrus
Repository: spark Updated Branches: refs/heads/branch-1.6 6e3e3c648 -> 74a230676 [SPARK-11856][SQL] add type cast if the real type is different but compatible with encoder schema When we build the `fromRowExpression` for an encoder, we set up a lot of "unresolved" stuff and lost the

spark git commit: [SPARK-12090] [PYSPARK] consider shuffle in coalesce()

2015-12-01 Thread davies
Repository: spark Updated Branches: refs/heads/master 0f37d1d7e -> 4375eb3f4 [SPARK-12090] [PYSPARK] consider shuffle in coalesce() Author: Davies Liu Closes #10090 from davies/fix_coalesce. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit:

spark git commit: [SPARK-12090] [PYSPARK] consider shuffle in coalesce()

2015-12-01 Thread davies
Repository: spark Updated Branches: refs/heads/branch-1.5 0d57a4ae1 -> ed7264ba2 [SPARK-12090] [PYSPARK] consider shuffle in coalesce() Author: Davies Liu Closes #10090 from davies/fix_coalesce. (cherry picked from commit 4375eb3f48fc7ae90caf6c21a0d3ab0b66bf4efa)

spark git commit: [SPARK-12090] [PYSPARK] consider shuffle in coalesce()

2015-12-01 Thread davies
Repository: spark Updated Branches: refs/heads/branch-1.6 3c4938e26 -> c47a7373a [SPARK-12090] [PYSPARK] consider shuffle in coalesce() Author: Davies Liu Closes #10090 from davies/fix_coalesce. (cherry picked from commit 4375eb3f48fc7ae90caf6c21a0d3ab0b66bf4efa)

spark git commit: [SPARK-12081] Make unified memory manager work with small heaps

2015-12-01 Thread andrewor14
Repository: spark Updated Branches: refs/heads/branch-1.6 72da2a21f -> 84c44b500 [SPARK-12081] Make unified memory manager work with small heaps The existing `spark.memory.fraction` (default 0.75) gives the system 25% of the space to work with. For small heaps, this is not enough: e.g.

spark git commit: [SPARK-8414] Ensure context cleaner periodic cleanups

2015-12-01 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.6 1b3db967e -> 72da2a21f [SPARK-8414] Ensure context cleaner periodic cleanups Garbage collection triggers cleanups. If the driver JVM is huge and there is little memory pressure, we may never clean up shuffle files on executors. This

spark git commit: [SPARK-11352][SQL][BRANCH-1.5] Escape */ in the generated comments.

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/branch-1.5 7460e4309 -> 4f07a590c [SPARK-11352][SQL][BRANCH-1.5] Escape */ in the generated comments. https://issues.apache.org/jira/browse/SPARK-11352 This one backports https://github.com/apache/spark/pull/10072 to branch 1.5. Author: Yin

spark git commit: [SPARK-12077][SQL] change the default plan for single distinct

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/branch-1.6 84c44b500 -> a5743affc [SPARK-12077][SQL] change the default plan for single distinct Use try to match the behavior for single distinct aggregation with Spark 1.5, but that's not scalable, we should be robust by default, have a flag

spark git commit: [SPARK-12077][SQL] change the default plan for single distinct

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/master d96f8c997 -> 96691feae [SPARK-12077][SQL] change the default plan for single distinct Use try to match the behavior for single distinct aggregation with Spark 1.5, but that's not scalable, we should be robust by default, have a flag to

spark git commit: [SPARK-12087][STREAMING] Create new JobConf for every batch in saveAsHadoopFiles

2015-12-01 Thread zsxwing
Repository: spark Updated Branches: refs/heads/master 96691feae -> 8a75a3049 [SPARK-12087][STREAMING] Create new JobConf for every batch in saveAsHadoopFiles The JobConf object created in `DStream.saveAsHadoopFiles` is used concurrently in multiple places: * The JobConf is updated by

spark git commit: [SPARK-12087][STREAMING] Create new JobConf for every batch in saveAsHadoopFiles

2015-12-01 Thread zsxwing
Repository: spark Updated Branches: refs/heads/branch-1.5 4f07a590c -> 0d57a4ae1 [SPARK-12087][STREAMING] Create new JobConf for every batch in saveAsHadoopFiles The JobConf object created in `DStream.saveAsHadoopFiles` is used concurrently in multiple places: * The JobConf is updated by

spark git commit: [SPARK-12087][STREAMING] Create new JobConf for every batch in saveAsHadoopFiles

2015-12-01 Thread zsxwing
Repository: spark Updated Branches: refs/heads/branch-1.6 a5743affc -> 1f42295b5 [SPARK-12087][STREAMING] Create new JobConf for every batch in saveAsHadoopFiles The JobConf object created in `DStream.saveAsHadoopFiles` is used concurrently in multiple places: * The JobConf is updated by

spark git commit: [SPARK-11949][SQL] Check bitmasks to set nullable property

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/branch-1.6 1f42295b5 -> 3c4938e26 [SPARK-11949][SQL] Check bitmasks to set nullable property Following up #10038. We can use bitmasks to determine which grouping expressions need to be set as nullable. cc yhuai Author: Liang-Chi Hsieh

spark git commit: [SPARK-12087][STREAMING] Create new JobConf for every batch in saveAsHadoopFiles

2015-12-01 Thread zsxwing
Repository: spark Updated Branches: refs/heads/branch-1.4 f5af299ab -> b6ba2dab2 [SPARK-12087][STREAMING] Create new JobConf for every batch in saveAsHadoopFiles The JobConf object created in `DStream.saveAsHadoopFiles` is used concurrently in multiple places: * The JobConf is updated by

spark git commit: [SPARK-11949][SQL] Check bitmasks to set nullable property

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/master 8a75a3049 -> 0f37d1d7e [SPARK-11949][SQL] Check bitmasks to set nullable property Following up #10038. We can use bitmasks to determine which grouping expressions need to be set as nullable. cc yhuai Author: Liang-Chi Hsieh

spark git commit: [SPARK-11954][SQL] Encoder for JavaBeans

2015-12-01 Thread marmbrus
Repository: spark Updated Branches: refs/heads/master 9df24624a -> fd95eeaf4 [SPARK-11954][SQL] Encoder for JavaBeans create java version of `constructorFor` and `extractorFor` in `JavaTypeInference` Author: Wenchen Fan This patch had conflicts when merged,

spark git commit: [SPARK-11954][SQL] Encoder for JavaBeans

2015-12-01 Thread marmbrus
Repository: spark Updated Branches: refs/heads/branch-1.6 74a230676 -> 88bbce008 [SPARK-11954][SQL] Encoder for JavaBeans create java version of `constructorFor` and `extractorFor` in `JavaTypeInference` Author: Wenchen Fan This patch had conflicts when merged,

spark git commit: [SPARK-11905][SQL] Support Persist/Cache and Unpersist in Dataset APIs

2015-12-01 Thread marmbrus
Repository: spark Updated Branches: refs/heads/branch-1.6 88bbce008 -> 40769b48c [SPARK-11905][SQL] Support Persist/Cache and Unpersist in Dataset APIs Persist and Unpersist exist in both RDD and Dataframe APIs. I think they are still very critical in Dataset APIs. Not sure if my

spark git commit: [SPARK-11905][SQL] Support Persist/Cache and Unpersist in Dataset APIs

2015-12-01 Thread marmbrus
Repository: spark Updated Branches: refs/heads/master fd95eeaf4 -> 0a7bca2da [SPARK-11905][SQL] Support Persist/Cache and Unpersist in Dataset APIs Persist and Unpersist exist in both RDD and Dataframe APIs. I think they are still very critical in Dataset APIs. Not sure if my understanding

spark git commit: [SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/branch-1.6 40769b48c -> 843a31afb [SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues This PR backports PR #10039 to master Author: Cheng Lian Closes #10063 from liancheng/spark-12046.doc-fix.master. (cherry picked

spark git commit: [SPARK-11821] Propagate Kerberos keytab for all environments

2015-12-01 Thread vanzin
Repository: spark Updated Branches: refs/heads/branch-1.6 843a31afb -> 99dc1335e [SPARK-11821] Propagate Kerberos keytab for all environments andrewor14 the same PR as in branch 1.5 harishreedharan Author: woj-i Closes #9859 from woj-i/master. (cherry picked from

spark git commit: [SPARK-11821] Propagate Kerberos keytab for all environments

2015-12-01 Thread vanzin
Repository: spark Updated Branches: refs/heads/master 0a7bca2da -> 6a8cf80cc [SPARK-11821] Propagate Kerberos keytab for all environments andrewor14 the same PR as in branch 1.5 harishreedharan Author: woj-i Closes #9859 from woj-i/master. Project:

spark git commit: [SPARK-12065] Upgrade Tachyon from 0.8.1 to 0.8.2

2015-12-01 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.6 99dc1335e -> ab2a124c8 [SPARK-12065] Upgrade Tachyon from 0.8.1 to 0.8.2 This commit upgrades the Tachyon dependency from 0.8.1 to 0.8.2. Author: Josh Rosen Closes #10054 from

spark git commit: [SPARK-12065] Upgrade Tachyon from 0.8.1 to 0.8.2

2015-12-01 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 6a8cf80cc -> 34e7093c1 [SPARK-12065] Upgrade Tachyon from 0.8.1 to 0.8.2 This commit upgrades the Tachyon dependency from 0.8.1 to 0.8.2. Author: Josh Rosen Closes #10054 from

spark git commit: Set SPARK_EC2_VERSION to 1.5.2

2015-12-01 Thread shivaram
Repository: spark Updated Branches: refs/heads/branch-1.5 d78f1bc45 -> 80dac0b07 Set SPARK_EC2_VERSION to 1.5.2 Author: Alexander Pivovarov Closes #10064 from apivovarov/patch-1. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit:

spark git commit: [SPARK-11898][MLLIB] Use broadcast for the global tables in Word2Vec

2015-12-01 Thread srowen
Repository: spark Updated Branches: refs/heads/master 9693b0d5a -> a0af0e351 [SPARK-11898][MLLIB] Use broadcast for the global tables in Word2Vec jira: https://issues.apache.org/jira/browse/SPARK-11898 syn0Global and sync1Global in word2vec are quite large objects with size (vocab *

spark git commit: [SPARK-11949][SQL] Set field nullable property for GroupingSets to get correct results for null values

2015-12-01 Thread yhuai
Repository: spark Updated Branches: refs/heads/master a0af0e351 -> c87531b76 [SPARK-11949][SQL] Set field nullable property for GroupingSets to get correct results for null values JIRA: https://issues.apache.org/jira/browse/SPARK-11949 The result of cube plan uses incorrect schema. The