[GitHub] spark issue #17083: [SPARK-19750][UI][branch-2.1] Fix redirect issue from ht...

2017-02-27 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/17083
  
Due to the change of (https://github.com/apache/spark/pull/16625), the 
issue is obsolete. So it effects spark 2.1 and 2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17071: [SPARK-15615][SQL][BUILD][FOLLOW-UP] Replace deprecated ...

2017-02-27 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17071
  
(I put a test here - 
https://github.com/apache/spark/pull/17071/files#diff-7e47859dbd409cc39f2908615fbd07ffR419)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-02-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r103214603
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
 ---
@@ -40,7 +41,19 @@ private[csv] object CSVInferSchema {
   csv: Dataset[String],
   caseSensitive: Boolean,
   options: CSVOptions): StructType = {
-val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, 
options).first()
+val lines = CSVUtils.filterCommentAndEmpty(csv, options)
--- End diff --

Hi @wojtek-szymanski I think we should not rely on exception handling. I 
can think of `take(1).headOption` but we could use shorten one if you know any 
other good way. What do you think about this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16809: [SPARK-19463][SQL]refresh cache after the InsertI...

2017-02-27 Thread windpiger
Github user windpiger commented on a diff in the pull request:

https://github.com/apache/spark/pull/16809#discussion_r103185139
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -132,6 +132,9 @@ case class InsertIntoHadoopFsRelationCommand(
 }
   }
 }
+
+sparkSession.catalog.refreshByPath(outputPath.toString)
--- End diff --

if we cache the table, refreshByPath will unpersist it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16990: [SPARK-19660][CORE][SQL] Replace the configuratio...

2017-02-27 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/16990#discussion_r103185158
  
--- Diff: 
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/smb_mapjoin_25.q 
---
@@ -19,7 +19,7 @@ select * from (select a.key from smb_bucket_1 a join 
smb_bucket_2 b on (a.key =
 
 set hive.optimize.bucketmapjoin=true;
 set hive.optimize.bucketmapjoin.sortedmerge=true;
-set hive.mapred.reduce.tasks.speculative.execution=false;
+set hive.mapreduce.job.reduces.speculative.execution=false;
--- End diff --

thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14731: [SPARK-17159] [streaming]: optimise check for new...

2017-02-27 Thread uncleGen
Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14731#discussion_r103187577
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
 ---
@@ -140,7 +137,7 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]](
* a union RDD out of them. Note that this maintains the list of files 
that were processed
* in the latest modification time in the previous call to this method. 
This is because the
* modification time returned by the FileStatus API seems to return 
times only at the
-   * granularity of seconds. And new files may have the same modification 
time as the
+   * granularity of seconds in HDFS. And new files may have the same 
modification time as the
* latest modification time in the previous call to this method yet was 
not reported in
--- End diff --

got it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...

2017-02-27 Thread zhengruifeng
Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/16971
  
ping @MLnick @gatorsmile @thunterdb 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

2017-02-27 Thread datumbox
Github user datumbox commented on the issue:

https://github.com/apache/spark/pull/17059
  
@srowen: Thanks for the comments. We are getting there. :)

I will handle the Long case as you suggest. 

If you think people use SQL decimal types, I can include them at the end of 
the pattern matching. This will lead to some duplicate code though cause I need 
to write the same if statement twice. Any thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16990: [SPARK-19660][CORE][SQL] Replace the configuratio...

2017-02-27 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/16990#discussion_r103183030
  
--- Diff: 
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/smb_mapjoin_25.q 
---
@@ -19,7 +19,7 @@ select * from (select a.key from smb_bucket_1 a join 
smb_bucket_2 b on (a.key =
 
 set hive.optimize.bucketmapjoin=true;
 set hive.optimize.bucketmapjoin.sortedmerge=true;
-set hive.mapred.reduce.tasks.speculative.execution=false;
+set hive.mapreduce.job.reduces.speculative.execution=false;
--- End diff --

looks like`hive.mapred.reduce.tasks.speculative.execution` in the [Hive 
wiki|https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties].
 

But probably best to pull in a Hive developer, maybe @jcamachor. Jesus: 
could you look at thee hive config options and make sure they are the current 
set?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...

2017-02-27 Thread robbinspg
Github user robbinspg commented on the issue:

https://github.com/apache/spark/pull/17039
  
@gatorsmile  I'm glad it wasn't just me that found it complex ;-)

I've modified the patch to remove an unnecessary change as that query was 
not ordered and the test suite code handles that case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15505: [SPARK-18890][CORE] Move task serialization from the Tas...

2017-02-27 Thread witgo
Github user witgo commented on the issue:

https://github.com/apache/spark/pull/15505
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16990: [SPARK-19660][CORE][SQL] Replace the configuratio...

2017-02-27 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16990#discussion_r103180859
  
--- Diff: 
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/smb_mapjoin_25.q 
---
@@ -19,7 +19,7 @@ select * from (select a.key from smb_bucket_1 a join 
smb_bucket_2 b on (a.key =
 
 set hive.optimize.bucketmapjoin=true;
 set hive.optimize.bucketmapjoin.sortedmerge=true;
-set hive.mapred.reduce.tasks.speculative.execution=false;
+set hive.mapreduce.job.reduces.speculative.execution=false;
--- End diff --

Is this supposed to be `mapreduce.reduce.speculative`? I'm looking at 
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/DeprecatedProperties.html
Or maybe the hive.* version is different?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17083: [SPARK-19750][UI][branch-2.1] Fix redirect issue from ht...

2017-02-27 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/17083
  
Was this fixed otherwise in master, or did some other change make it 
obsolete? just trying to link this to whatever reason it's only a problem in 
2.1, for the record.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14731: [SPARK-17159] [streaming]: optimise check for new...

2017-02-27 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/14731#discussion_r103183646
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
 ---
@@ -140,7 +137,7 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]](
* a union RDD out of them. Note that this maintains the list of files 
that were processed
* in the latest modification time in the previous call to this method. 
This is because the
* modification time returned by the FileStatus API seems to return 
times only at the
-   * granularity of seconds. And new files may have the same modification 
time as the
+   * granularity of seconds in HDFS. And new files may have the same 
modification time as the
* latest modification time in the previous call to this method yet was 
not reported in
--- End diff --

no, it really is HDFS alone. We have no idea whatsoever about the 
granuarity of other filesystems. Could be 2 seconds (is FAT 32 supported? Hope 
not) NTFS is in nanoseconds 
[apparently](https://msdn.microsoft.com/en-us/library/windows/desktop/ms724290(v=vs.85).aspx).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17080: [SPARK-19739][CORE] propagate S3 session token to cluser

2017-02-27 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/17080
  
LGTM. Verified option name in `org.apache.hadoop.fs.s3a.Constants` file; 
env var name in `com.amazonaws.SDKGlobalConfiguration'




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...

2017-02-27 Thread robbinspg
Github user robbinspg commented on the issue:

https://github.com/apache/spark/pull/17039
  
Jenkins retest please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16990: [SPARK-19660][CORE][SQL] Replace the configuratio...

2017-02-27 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/16990#discussion_r103200223
  
--- Diff: python/pyspark/tests.py ---
@@ -1515,12 +1515,12 @@ def test_oldhadoop(self):
 
 conf = {
 "mapred.output.format.class": 
"org.apache.hadoop.mapred.SequenceFileOutputFormat",
--- End diff --

@srowen Thanks, you are right. see:

https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/Job.java#L1296

if set `mapred.reducer.new-api=true`, the exception:
```
Exception in thread "main" java.io.IOException: mapred.output.format.class 
is incompatible with new reduce API mode.
at org.apache.hadoop.mapreduce.Job.ensureNotSet(Job.java:1210)
at org.apache.hadoop.mapreduce.Job.setUseNewAPI(Job.java:1258)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1299)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1324)
at wpmcn.hadoop.WordCount.main(WordCount.java:63)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
```
I will update it soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16990: [SPARK-19660][CORE][SQL] Replace the configuratio...

2017-02-27 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16990#discussion_r103179898
  
--- Diff: python/pyspark/tests.py ---
@@ -1515,12 +1515,12 @@ def test_oldhadoop(self):
 
 conf = {
 "mapred.output.format.class": 
"org.apache.hadoop.mapred.SequenceFileOutputFormat",
--- End diff --

I'm not sure what this key was supposed to be before; maybe 
`mapreduce.outputformat.class`? but it can be 
`mapreduce.job.outputformat.class` now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14731: [SPARK-17159] [streaming]: optimise check for new...

2017-02-27 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/14731#discussion_r103184528
  
--- Diff: docs/streaming-programming-guide.md ---
@@ -615,35 +615,114 @@ which creates a DStream from text
 data received over a TCP socket connection. Besides sockets, the 
StreamingContext API provides
 methods for creating DStreams from files as input sources.
 
-- **File Streams:** For reading data from files on any file system 
compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be 
created as:
+ File Streams
+{:.no_toc}
+
+For reading data from files on any file system compatible with the HDFS 
API (that is, HDFS, S3, NFS, etc.), a DStream can be created as
+via `StreamingContext.fileStream[KeyClass, ValueClass, InputFormatClass]`.
+
+File streams do not require running a receiver, hence does not require 
allocating cores.
+
+For simple text files, the easiest method is 
`StreamingContext.textFileStream(dataDirectory)`. 
+
+
+
+
+{% highlight scala %}
+streamingContext.fileStream[KeyClass, ValueClass, 
InputFormatClass](dataDirectory)
+{% endhighlight %}
+For text files
+
+{% highlight scala %}
+streamingContext.textFileStream(dataDirectory)
+{% endhighlight %}
+
+
+
+{% highlight java %}
+streamingContext.fileStream(dataDirectory);
+{% endhighlight %}
+For text files
+
+{% highlight java %}
+streamingContext.textFileStream(dataDirectory);
+{% endhighlight %}
+
 
-
-
-streamingContext.fileStream[KeyClass, ValueClass, 
InputFormatClass](dataDirectory)
-
-
-   streamingContext.fileStream(dataDirectory);
-
-
-   streamingContext.textFileStream(dataDirectory)
-
-
+
+`fileStream` is not available in the Python API; only `textFileStream` is 
available.
+{% highlight python %}
+streamingContext.textFileStream(dataDirectory)
+{% endhighlight %}
+
 
-   Spark Streaming will monitor the directory `dataDirectory` and process 
any files created in that directory (files written in nested directories not 
supported). Note that
+
 
- + The files must have the same data format.
- + The files must be created in the `dataDirectory` by atomically 
*moving* or *renaming* them into
- the data directory.
- + Once moved, the files must not be changed. So if the files are 
being continuously appended, the new data will not be read.
+# How Directories are Monitored
+{:.no_toc}
 
-   For simple text files, there is an easier method 
`streamingContext.textFileStream(dataDirectory)`. And file streams do not 
require running a receiver, hence does not require allocating cores.
+Spark Streaming will monitor the directory `dataDirectory` and process any 
files created in that directory.
+
+   * A simple directory can be monitored, such as 
`"hdfs://namenode:8040/logs/"`.
+ All files directly under such a path will be processed as they are 
discovered.
+   + A [POSIX glob 
pattern](http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_13_02)
 can be supplied, such as
+ `"hdfs://namenode:8040/logs/2017/*"`.
+ Here, the DStream will consist of all files in the directories
+ matching the pattern.
+ That is: it is a pattern of directories, not of files in directories.
+   + All files must be in the same data format.
+   * A file is considered part of a time period based on its modification 
time,
+ not its creation time.
+   + Once processed, changes to a file within the current window will not 
cause the file to be reread.
+ That is: *updates are ignored*.
+   + The more files under a directory, the longer it will take to
+ scan for changes — even if no files have been modified.
+   * If a wildcard is used to identify directories, such as 
`"hdfs://namenode:8040/logs/2016-*"`,
+ renaming an entire directory to match the path will add the directory 
to the list of
+ monitored directories. Only the files in the directory whose 
modification time is
+ within the current window will be included in the stream.
+   + Calling 
[`FileSystem.setTimes()`](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#setTimes-org.apache.hadoop.fs.Path-long-long-)
+ to fix the timestamp is a way to have the file picked up in a later 
window, even if its contents have not changed.
+
+
--- End diff --

for "real" filesystems, rename doesn't change modtime, and files become 
visible in create(), so if you do a create() in the dest dir the file may be 
found and 

[GitHub] spark issue #16867: [WIP][SPARK-16929] Improve performance when check specul...

2017-02-27 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/16867
  
@squito
Thanks a lot for your comments : )
>When check speculatable tasks in TaskSetManager, current code scan all 
task infos and sort durations of successful tasks in O(NlogN) time complexity.

`checkSpeculatableTasks` is scheduled every 100ms by 
`scheduleAtFixedRate`(not `scheduleWithFixedDelay `), thus the interval can be 
less than 100ms. In my cluster(yarn-cluster mode), if size of the task set is 
over 30 and the driver is running on some machine with poor cpu 
performance, the `Arrays.sort` can take over than 100ms easily. Since 
`checkSpeculatableTasks` will synchronize `TaskSchedulerImpl`, I suspect that's 
why my driver hang.

I get median duration by `TreeSet.slice`, which comes from `IterableLike` 
and cannot jump to the mid position unluckily. The time complexity is O(n) in 
this pr.
I can get the mid position by reflection, but I don't want to do that, I 
think that is harmful for code clarity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17071: [SPARK-15615][SQL][BUILD][FOLLOW-UP] Replace deprecated ...

2017-02-27 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17071
  
Sure, sounds better and I can't find a reason to not follow. Let me maybe 
add single small Java one somewhere because the deprecated Java one calls the 
deprecated Scala one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

2017-02-27 Thread datumbox
Github user datumbox commented on the issue:

https://github.com/apache/spark/pull/17059
  
Ignore my comment about duplicate code. It can be written to avoid it. I 
will investigate handling the SQL decimal types as you recommended and I will 
update the code tonight.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16990: [SPARK-19660][CORE][SQL] Replace the configuratio...

2017-02-27 Thread jcamachor
Github user jcamachor commented on a diff in the pull request:

https://github.com/apache/spark/pull/16990#discussion_r103184073
  
--- Diff: 
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/smb_mapjoin_25.q 
---
@@ -19,7 +19,7 @@ select * from (select a.key from smb_bucket_1 a join 
smb_bucket_2 b on (a.key =
 
 set hive.optimize.bucketmapjoin=true;
 set hive.optimize.bucketmapjoin.sortedmerge=true;
-set hive.mapred.reduce.tasks.speculative.execution=false;
+set hive.mapreduce.job.reduces.speculative.execution=false;
--- End diff --

@steveloughran , I checked the code and property name in Hive is 
```hive.mapred.reduce.tasks.speculative.execution```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17076: [SPARK-19745][ML] SVCAggregator captures coeffici...

2017-02-27 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/17076#discussion_r103187723
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -440,19 +440,9 @@ private class LinearSVCAggregator(
 
   private val numFeatures: Int = bcFeaturesStd.value.length
   private val numFeaturesPlusIntercept: Int = if (fitIntercept) 
numFeatures + 1 else numFeatures
-  private val coefficients: Vector = bcCoefficients.value
   private var weightSum: Double = 0.0
   private var lossSum: Double = 0.0
-  require(numFeaturesPlusIntercept == coefficients.size, s"Dimension 
mismatch. Coefficients " +
-s"length ${coefficients.size}, FeaturesStd length ${numFeatures}, 
fitIntercept: $fitIntercept")
-
-  private val coefficientsArray = coefficients match {
--- End diff --

We used to check it I think? At some point there was a BLAS operation used 
that only worked for dense vectors. I think during all the linear model 
refactor for 2.0/2.1 that was eliminated


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17083: [SPARK-19750][UI][branch-2.1] Fix redirect issue from ht...

2017-02-27 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/17083
  
Not sure why Jenkins test cannot be started automatically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17090
  
**[Test build #73543 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73543/testReport)**
 for PR 17090 at commit 
[`832b066`](https://github.com/apache/spark/commit/832b066f490c212b5a79fd045460525afd9576b9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16959: [SPARK-19631][CORE] OutputCommitCoordinator should not a...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16959
  
**[Test build #73544 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73544/testReport)**
 for PR 16959 at commit 
[`20f028a`](https://github.com/apache/spark/commit/20f028ad5e6f746842ca3dd10ea12811a4a699a4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17052: [SPARK-19690][SS] Join a streaming DataFrame with a batc...

2017-02-27 Thread uncleGen
Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17052
  
working on unit test failure


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17012: [SPARK-19677][SS] Renaming a file atop an existing one s...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17012
  
**[Test build #73548 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73548/testReport)**
 for PR 17012 at commit 
[`530c027`](https://github.com/apache/spark/commit/530c027e8ac22caa6fac3770ae24c6727ab7c018).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17012: [SPARK-19677][SS] Renaming a file atop an existing one s...

2017-02-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17012
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73548/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17012: [SPARK-19677][SS] Renaming a file atop an existing one s...

2017-02-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17012
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17082: [SPARK-19749][SS] Name socket source with a meaningful n...

2017-02-27 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/17082
  
Thanks! LGTM. Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16774: [SPARK-19357][ML] Adding parallel model evaluation in ML...

2017-02-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16774
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16774: [SPARK-19357][ML] Adding parallel model evaluation in ML...

2017-02-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16774
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73545/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17093: [SPARK-19761][SQL]create InMemoryFileIndex with a...

2017-02-27 Thread windpiger
GitHub user windpiger opened a pull request:

https://github.com/apache/spark/pull/17093

[SPARK-19761][SQL]create InMemoryFileIndex with an empty rootPaths when set 
PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero failed

## What changes were proposed in this pull request?

If we create a InMemoryFileIndex with an empty rootPaths when set 
PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero, it will throw an  exception:

```
Positive number of slices required
java.lang.IllegalArgumentException: Positive number of slices required
at 
org.apache.spark.rdd.ParallelCollectionRDD$.slice(ParallelCollectionRDD.scala:119)
at 
org.apache.spark.rdd.ParallelCollectionRDD.getPartitions(ParallelCollectionRDD.scala:97)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2084)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles(PartitioningAwareFileIndex.scala:357)
at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.listLeafFiles(PartitioningAwareFileIndex.scala:256)
at 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:74)
at 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.(InMemoryFileIndex.scala:50)
at 
org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9$$anonfun$apply$mcV$sp$2.apply$mcV$sp(FileIndexSuite.scala:186)
at 
org.apache.spark.sql.test.SQLTestUtils$class.withSQLConf(SQLTestUtils.scala:105)
at 
org.apache.spark.sql.execution.datasources.FileIndexSuite.withSQLConf(FileIndexSuite.scala:33)
at 
org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply$mcV$sp(FileIndexSuite.scala:185)
at 
org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply(FileIndexSuite.scala:185)
at 
org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply(FileIndexSuite.scala:185)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
```

## How was this patch tested?
unit test added

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/windpiger/spark fixEmptiPathInBulkListFiles

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17093.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17093


commit 96898a2332c64b101efc54d1ccbbf29102b88e68
Author: windpiger 
Date:   2017-02-28T02:21:29Z

[SPARK-19761][SQL]create InMemoryFileIndex with an empty rootPaths when set 
PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero failed




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, 

[GitHub] spark issue #16989: [WIP][SPARK-19659] Fetch big blocks to disk when shuffle...

2017-02-27 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/16989
  
@squito 
I've uploaded a design doc to jira, please take a look when you have time :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17092: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17092
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17047: [SPARK-19720][SPARK SUBMIT] Redact sensitive info...

2017-02-27 Thread markgrover
Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/17047#discussion_r103367049
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2574,13 +2575,30 @@ private[spark] object Utils extends Logging {
 
   def redact(conf: SparkConf, kvs: Seq[(String, String)]): Seq[(String, 
String)] = {
 val redactionPattern = conf.get(SECRET_REDACTION_PATTERN).r
+redact(redactionPattern, kvs)
+  }
+
+  private def redact(redactionPattern: Regex, kvs: Seq[(String, String)]): 
Seq[(String, String)] = {
 kvs.map { kv =>
   redactionPattern.findFirstIn(kv._1)
 .map { ignore => (kv._1, REDACTION_REPLACEMENT_TEXT) }
 .getOrElse(kv)
 }
   }
 
+  /**
+   * Looks up the redaction regex from within the key value pairs and uses 
it to redact the rest
+   * of the key value pairs. No care is taken to make sure the redaction 
property itself is not
+   * redacted. So theoretically, the property itself could be configured 
to redact its own value
+   * when printing.
+   * @param kvs
+   * @return
+   */
+  def redact(kvs: Map[String, String]): Seq[(String, String)] = {
--- End diff --

Correct, that's exactly the use case - where there isn't a conf object 
available yet. I will update the Javadoc. Thanks for reviewing!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17092: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17092
  
**[Test build #73550 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73550/testReport)**
 for PR 17092 at commit 
[`9dd87ba`](https://github.com/apache/spark/commit/9dd87ba21a025939df7020ff1491a2c6c29f2d93).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17092: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17092
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73550/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17079: [SPARK-19748][SQL]refresh function has a wrong order to ...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17079
  
**[Test build #73546 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73546/testReport)**
 for PR 17079 at commit 
[`1ec20a5`](https://github.com/apache/spark/commit/1ec20a5146e7e7c79386d0c0e9fdffad3254cdc6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17094: [SPARK-19762][ML][WIP] Hierarchy for consolidating ML ag...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17094
  
**[Test build #73557 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73557/testReport)**
 for PR 17094 at commit 
[`9a04d0b`](https://github.com/apache/spark/commit/9a04d0bc51bed29bca28a5e34ebc5b614b6560d2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17052: [SPARK-19690][SS] Join a streaming DataFrame with a batc...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17052
  
**[Test build #73558 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73558/testReport)**
 for PR 17052 at commit 
[`c87651a`](https://github.com/apache/spark/commit/c87651abaf4af9eea1b292495fb0708dd0265274).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17095: [SPARK-19763][SQL]qualified external datasource table lo...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17095
  
**[Test build #73556 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73556/testReport)**
 for PR 17095 at commit 
[`570ce24`](https://github.com/apache/spark/commit/570ce24bee80dad5b2e897db34d04f3752139555).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17079: [SPARK-19748][SQL]refresh function has a wrong or...

2017-02-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17079#discussion_r103373646
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
 ---
@@ -178,6 +178,33 @@ class FileIndexSuite extends SharedSQLContext {
   assert(catalog2.allFiles().nonEmpty)
 }
   }
+
+  test("refresh for InMemoryFileIndex with FileStatusCache") {
+withTempDir { dir =>
+  val fileStatusCache = FileStatusCache.getOrCreate(spark)
+  val dirPath = new Path(dir.getAbsolutePath)
+  val fs = dirPath.getFileSystem(spark.sessionState.newHadoopConf())
+  val catalog =
+new InMemoryFileIndex(spark, Seq(dirPath), Map.empty, None, 
fileStatusCache) {
+def leafFilePaths: Seq[Path] = leafFiles.keys.toSeq
+def leafDirPaths: Seq[Path] = leafDirToChildrenFiles.keys.toSeq
+  }
--- End diff --

Nit: Indents issues for the above three lines.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17079: [SPARK-19748][SQL]refresh function has a wrong order to ...

2017-02-27 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17079
  
LGTM except two minor comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17079: [SPARK-19748][SQL]refresh function has a wrong or...

2017-02-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17079#discussion_r103373620
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
 ---
@@ -178,6 +178,33 @@ class FileIndexSuite extends SharedSQLContext {
   assert(catalog2.allFiles().nonEmpty)
 }
   }
+
+  test("refresh for InMemoryFileIndex with FileStatusCache") {
+withTempDir { dir =>
+  val fileStatusCache = FileStatusCache.getOrCreate(spark)
+  val dirPath = new Path(dir.getAbsolutePath)
+  val fs = dirPath.getFileSystem(spark.sessionState.newHadoopConf())
+  val catalog =
+new InMemoryFileIndex(spark, Seq(dirPath), Map.empty, None, 
fileStatusCache) {
+def leafFilePaths: Seq[Path] = leafFiles.keys.toSeq
+def leafDirPaths: Seq[Path] = leafDirToChildrenFiles.keys.toSeq
+  }
+
+  assert(catalog.leafDirPaths.isEmpty)
+  assert(catalog.leafFilePaths.isEmpty)
--- End diff --

Move these two asserts after `stringToFile`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17059: [SPARK-19733][ML]Removed unnecessary castings and refact...

2017-02-27 Thread imatiach-msft
Github user imatiach-msft commented on the issue:

https://github.com/apache/spark/pull/17059
  
@datumbox I like the changes, I just had a minor concern about the code 
where we call v.intValue and then compare this to v.doubleValue -- due to 
precision issues, I'm not sure if this is desirable, since the data could come 
from any source and be slightly modified outside the Int range, and the 
previous code does not do this check


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17015: [SPARK-19678][SQL] remove MetastoreRelation

2017-02-27 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17015
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #10307: [SPARK-12334][SQL][PYSPARK] Support read from multiple i...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/10307
  
**[Test build #73567 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73567/testReport)**
 for PR 10307 at commit 
[`401f682`](https://github.com/apache/spark/commit/401f6829dcadb6d0f2ce51c99520cc55dbc28995).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17015: [SPARK-19678][SQL] remove MetastoreRelation

2017-02-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17015#discussion_r103387597
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -40,38 +38,24 @@ case class AnalyzeColumnCommand(
 val sessionState = sparkSession.sessionState
 val db = 
tableIdent.database.getOrElse(sessionState.catalog.getCurrentDatabase)
 val tableIdentWithDB = TableIdentifier(tableIdent.table, Some(db))
-val relation =
-  
EliminateSubqueryAliases(sparkSession.table(tableIdentWithDB).queryExecution.analyzed)
-
-// Compute total size
-val (catalogTable: CatalogTable, sizeInBytes: Long) = relation match {
-  case catalogRel: CatalogRelation =>
-// This is a Hive serde format table
-(catalogRel.catalogTable,
-  AnalyzeTableCommand.calculateTotalSize(sessionState, 
catalogRel.catalogTable))
-
-  case logicalRel: LogicalRelation if 
logicalRel.catalogTable.isDefined =>
-// This is a data source format table
-(logicalRel.catalogTable.get,
-  AnalyzeTableCommand.calculateTotalSize(sessionState, 
logicalRel.catalogTable.get))
-
-  case otherRelation =>
-throw new AnalysisException("ANALYZE TABLE is not supported for " +
-  s"${otherRelation.nodeName}.")
+val tableMeta = sessionState.catalog.getTableMetadata(tableIdentWithDB)
+if (tableMeta.tableType == CatalogTableType.VIEW) {
+  throw new AnalysisException("ANALYZE TABLE is not supported on 
views.")
 }
+val sizeInBytes = AnalyzeTableCommand.calculateTotalSize(sessionState, 
tableMeta)
 
 // Compute stats for each column
 val (rowCount, newColStats) =
-  AnalyzeColumnCommand.computeColumnStats(sparkSession, 
tableIdent.table, relation, columnNames)
+  AnalyzeColumnCommand.computeColumnStats(sparkSession, 
tableIdentWithDB, columnNames)
--- End diff --

`object AnalyzeColumnCommand` is not needed. We can move 
`computeColumnStats ` into the `case class AnalyzeColumnCommand`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17079: [SPARK-19748][SQL]refresh function has a wrong or...

2017-02-27 Thread windpiger
Github user windpiger commented on a diff in the pull request:

https://github.com/apache/spark/pull/17079#discussion_r103357633
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
 ---
@@ -178,6 +178,34 @@ class FileIndexSuite extends SharedSQLContext {
   assert(catalog2.allFiles().nonEmpty)
 }
   }
+
+  test("refresh for InMemoryFileIndex with FileStatusCache") {
+withTempDir { dir =>
+  val fileStatusCache = FileStatusCache.getOrCreate(spark)
+  val dirPath = new Path(dir.getAbsolutePath)
+  val catalog = new InMemoryFileIndex(spark, Seq(dirPath), Map.empty,
+None, fileStatusCache) {
+def leafFilePaths: Seq[Path] = leafFiles.keys.toSeq
+def leafDirPaths: Seq[Path] = leafDirToChildrenFiles.keys.toSeq
+  }
+
+  assert(catalog.leafDirPaths.isEmpty)
+  assert(catalog.leafFilePaths.isEmpty)
+
+  val file = new File(dir, "text.txt")
+  stringToFile(file, "text")
+
+  catalog.refresh()
+
+  assert(catalog.leafFilePaths.size == 1)
+  assert(catalog.leafFilePaths.head.toString.stripSuffix("/") ==
+s"file:${file.getAbsolutePath.stripSuffix("/")}")
--- End diff --

ok, let me modify~ thanks~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...

2017-02-27 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/17062#discussion_r103357272
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
 ---
@@ -169,6 +171,96 @@ class HashExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 // scalastyle:on nonascii
   }
 
+  test("hive-hash for date type") {
+def checkHiveHashForDateType(dateString: String, expected: Long): Unit 
= {
+  checkHiveHash(
+DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get,
+DateType,
+expected)
+}
+
+// basic case
+checkHiveHashForDateType("2017-01-01", 17167)
+
+// boundary cases
+checkHiveHashForDateType("-01-01", -719530)
+checkHiveHashForDateType("-12-31", 2932896)
+
+// epoch
+checkHiveHashForDateType("1970-01-01", 0)
+
+// before epoch
+checkHiveHashForDateType("1800-01-01", -62091)
+
+// Invalid input: bad date string. Hive returns 0 for such cases
+intercept[NoSuchElementException](checkHiveHashForDateType("0-0-0", 0))
+
intercept[NoSuchElementException](checkHiveHashForDateType("-1212-01-01", 0))
+
intercept[NoSuchElementException](checkHiveHashForDateType("2016-99-99", 0))
+
+// Invalid input: Empty string. Hive returns 0 for this case
+intercept[NoSuchElementException](checkHiveHashForDateType("", 0))
+
+// Invalid input: February 30th for a leap year. Hive supports this 
but Spark doesn't
+
intercept[NoSuchElementException](checkHiveHashForDateType("2016-02-30", 16861))
+  }
+
+  test("hive-hash for timestamp type") {
+def checkHiveHashForTimestampType(
+timestamp: String,
+expected: Long,
+timeZone: TimeZone = TimeZone.getTimeZone("UTC")): Unit = {
+  checkHiveHash(
+DateTimeUtils.stringToTimestamp(UTF8String.fromString(timestamp), 
timeZone).get,
+TimestampType,
+expected)
+}
+
+// basic case
+checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445725271)
+
+// with higher precision
+checkHiveHashForTimestampType("2017-02-24 10:56:29.11", 1353936655)
+
+// with different timezone
+checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445732471,
+  TimeZone.getTimeZone("US/Pacific"))
+
+// boundary cases
+checkHiveHashForTimestampType("0001-01-01 00:00:00", 1645926784)
+checkHiveHashForTimestampType("-01-01 00:00:00", -1081818240)
+
+// epoch
+checkHiveHashForTimestampType("1970-01-01 00:00:00", 0)
+
+// before epoch
+checkHiveHashForTimestampType("1800-01-01 03:12:45", -267420885)
+
+// Invalid input: bad timestamp string. Hive returns 0 for such cases
+intercept[NoSuchElementException](checkHiveHashForTimestampType("0-0-0 
0:0:0", 0))
+
intercept[NoSuchElementException](checkHiveHashForTimestampType("-99-99-99 
99:99:45", 0))
+
intercept[NoSuchElementException](checkHiveHashForTimestampType("55-5-",
 0))
+
+// Invalid input: Empty string. Hive returns 0 for this case
+intercept[NoSuchElementException](checkHiveHashForTimestampType("", 0))
+
+// Invalid input: February 30th for a leap year. Hive supports this 
but Spark doesn't
+
intercept[NoSuchElementException](checkHiveHashForTimestampType("2016-02-30 
00:00:00", 0))
+
+// Invalid input: Hive accepts upto 9 decimal place precision but 
Spark uses upto 6
+
intercept[TestFailedException](checkHiveHashForTimestampType("2017-02-24 
10:56:29.", 0))
+  }
+
+  test("hive-hash for CalendarInterval type") {
+def checkHiveHashForTimestampType(interval: String, expected: Long): 
Unit = {
+  checkHiveHash(CalendarInterval.fromString(interval), 
CalendarIntervalType, expected)
+}
+
+checkHiveHashForTimestampType("interval 1 day", 3220073)
--- End diff --

SELECT HASH ( INTERVAL '1' DAY );


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...

2017-02-27 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/17062#discussion_r103357588
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
 ---
@@ -169,6 +171,96 @@ class HashExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 // scalastyle:on nonascii
   }
 
+  test("hive-hash for date type") {
+def checkHiveHashForDateType(dateString: String, expected: Long): Unit 
= {
+  checkHiveHash(
+DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get,
+DateType,
+expected)
+}
+
+// basic case
+checkHiveHashForDateType("2017-01-01", 17167)
+
+// boundary cases
+checkHiveHashForDateType("-01-01", -719530)
+checkHiveHashForDateType("-12-31", 2932896)
+
+// epoch
+checkHiveHashForDateType("1970-01-01", 0)
+
+// before epoch
+checkHiveHashForDateType("1800-01-01", -62091)
+
+// Invalid input: bad date string. Hive returns 0 for such cases
+intercept[NoSuchElementException](checkHiveHashForDateType("0-0-0", 0))
+
intercept[NoSuchElementException](checkHiveHashForDateType("-1212-01-01", 0))
+
intercept[NoSuchElementException](checkHiveHashForDateType("2016-99-99", 0))
+
+// Invalid input: Empty string. Hive returns 0 for this case
+intercept[NoSuchElementException](checkHiveHashForDateType("", 0))
+
+// Invalid input: February 30th for a leap year. Hive supports this 
but Spark doesn't
+
intercept[NoSuchElementException](checkHiveHashForDateType("2016-02-30", 16861))
+  }
+
+  test("hive-hash for timestamp type") {
+def checkHiveHashForTimestampType(
+timestamp: String,
+expected: Long,
+timeZone: TimeZone = TimeZone.getTimeZone("UTC")): Unit = {
+  checkHiveHash(
+DateTimeUtils.stringToTimestamp(UTF8String.fromString(timestamp), 
timeZone).get,
+TimestampType,
+expected)
+}
+
+// basic case
+checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445725271)
+
+// with higher precision
+checkHiveHashForTimestampType("2017-02-24 10:56:29.11", 1353936655)
+
+// with different timezone
+checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445732471,
+  TimeZone.getTimeZone("US/Pacific"))
+
+// boundary cases
+checkHiveHashForTimestampType("0001-01-01 00:00:00", 1645926784)
+checkHiveHashForTimestampType("-01-01 00:00:00", -1081818240)
+
+// epoch
+checkHiveHashForTimestampType("1970-01-01 00:00:00", 0)
+
+// before epoch
+checkHiveHashForTimestampType("1800-01-01 03:12:45", -267420885)
+
+// Invalid input: bad timestamp string. Hive returns 0 for such cases
+intercept[NoSuchElementException](checkHiveHashForTimestampType("0-0-0 
0:0:0", 0))
+
intercept[NoSuchElementException](checkHiveHashForTimestampType("-99-99-99 
99:99:45", 0))
+
intercept[NoSuchElementException](checkHiveHashForTimestampType("55-5-",
 0))
+
+// Invalid input: Empty string. Hive returns 0 for this case
+intercept[NoSuchElementException](checkHiveHashForTimestampType("", 0))
+
+// Invalid input: February 30th for a leap year. Hive supports this 
but Spark doesn't
+
intercept[NoSuchElementException](checkHiveHashForTimestampType("2016-02-30 
00:00:00", 0))
+
+// Invalid input: Hive accepts upto 9 decimal place precision but 
Spark uses upto 6
+
intercept[TestFailedException](checkHiveHashForTimestampType("2017-02-24 
10:56:29.", 0))
+  }
+
+  test("hive-hash for CalendarInterval type") {
+def checkHiveHashForTimestampType(interval: String, expected: Long): 
Unit = {
+  checkHiveHash(CalendarInterval.fromString(interval), 
CalendarIntervalType, expected)
+}
+
+checkHiveHashForTimestampType("interval 1 day", 3220073)
+checkHiveHashForTimestampType("interval 6 day 15 hour", 21202073)
+checkHiveHashForTimestampType("interval -23 day 56 hour -113 
minute 9898989 second",
--- End diff --

 SELECT HASH ( INTERVAL '-23' DAY + INTERVAL '56' HOUR + INTERVAL 
'-113' MINUTE + INTERVAL '9898989' SECOND );


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: 

[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...

2017-02-27 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/17062#discussion_r103300592
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
 ---
@@ -169,6 +171,96 @@ class HashExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 // scalastyle:on nonascii
   }
 
+  test("hive-hash for date type") {
+def checkHiveHashForDateType(dateString: String, expected: Long): Unit 
= {
+  checkHiveHash(
+DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get,
+DateType,
+expected)
+}
+
+// basic case
+checkHiveHashForDateType("2017-01-01", 17167)
+
+// boundary cases
+checkHiveHashForDateType("-01-01", -719530)
+checkHiveHashForDateType("-12-31", 2932896)
+
+// epoch
+checkHiveHashForDateType("1970-01-01", 0)
+
+// before epoch
+checkHiveHashForDateType("1800-01-01", -62091)
+
+// Invalid input: bad date string. Hive returns 0 for such cases
+intercept[NoSuchElementException](checkHiveHashForDateType("0-0-0", 0))
+
intercept[NoSuchElementException](checkHiveHashForDateType("-1212-01-01", 0))
+
intercept[NoSuchElementException](checkHiveHashForDateType("2016-99-99", 0))
+
+// Invalid input: Empty string. Hive returns 0 for this case
+intercept[NoSuchElementException](checkHiveHashForDateType("", 0))
+
+// Invalid input: February 30th for a leap year. Hive supports this 
but Spark doesn't
+
intercept[NoSuchElementException](checkHiveHashForDateType("2016-02-30", 16861))
+  }
+
+  test("hive-hash for timestamp type") {
+def checkHiveHashForTimestampType(
+timestamp: String,
+expected: Long,
+timeZone: TimeZone = TimeZone.getTimeZone("UTC")): Unit = {
+  checkHiveHash(
+DateTimeUtils.stringToTimestamp(UTF8String.fromString(timestamp), 
timeZone).get,
+TimestampType,
+expected)
+}
+
+// basic case
+checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445725271)
+
+// with higher precision
+checkHiveHashForTimestampType("2017-02-24 10:56:29.11", 1353936655)
+
+// with different timezone
+checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445732471,
+  TimeZone.getTimeZone("US/Pacific"))
+
+// boundary cases
+checkHiveHashForTimestampType("0001-01-01 00:00:00", 1645926784)
+checkHiveHashForTimestampType("-01-01 00:00:00", -1081818240)
+
+// epoch
+checkHiveHashForTimestampType("1970-01-01 00:00:00", 0)
+
+// before epoch
+checkHiveHashForTimestampType("1800-01-01 03:12:45", -267420885)
+
+// Invalid input: bad timestamp string. Hive returns 0 for such cases
--- End diff --

same as `Date`, invalid timestamp values are not allowed in Spark and it 
will fail. Hive will not fail but fallback to `null` and return `0` as hash 
value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r103357342
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,196 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing (LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel(JavaModel):
+"""
+Mixin for Locality Sensitive Hashing (LSH) models.
+"""
+
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+.. note:: This method is experimental and will likely change 
behavior in the next release.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A column "distCol" is
+ added to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two datasets to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each pair of rows. Use
+"distCol" as default value if it's not specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+ "datasetA" and "datasetB", and a column "distCol" is 
added to show the distance
+ between each pair.
+"""
+return self._call_java("approxSimilarityJoin", datasetA, datasetB, 
threshold, distCol)
+
+
+@inherit_doc
+class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, 
HasOutputCol, HasSeed,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+LSH class for Euclidean distance metrics.
+The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
+distance space. The output will be vectors of configurable dimension. 
Hash values in the same
+dimension are calculated by the same hash function.
+
+.. seealso:: `Stable Distributions \
+
`_
+.. seealso:: `Hashing for Similarity Search: A Survey 
`_
+
+>>> from pyspark.ml.linalg import Vectors
+>>> from pyspark.sql.functions import col
+>>> data = [(0, Vectors.dense([-1.0, -1.0 ]),),
+... (1, Vectors.dense([-1.0, 

[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...

2017-02-27 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/17062#discussion_r103281696
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
 ---
@@ -169,6 +171,96 @@ class HashExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 // scalastyle:on nonascii
   }
 
+  test("hive-hash for date type") {
+def checkHiveHashForDateType(dateString: String, expected: Long): Unit 
= {
+  checkHiveHash(
+DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get,
+DateType,
+expected)
+}
+
+// basic case
+checkHiveHashForDateType("2017-01-01", 17167)
--- End diff --

expected values computed over hive 1.2. using:

```
SELECT HASH( CAST( "2017-01-01" AS DATE) )
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...

2017-02-27 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/17062#discussion_r103300013
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
 ---
@@ -169,6 +171,96 @@ class HashExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 // scalastyle:on nonascii
   }
 
+  test("hive-hash for date type") {
+def checkHiveHashForDateType(dateString: String, expected: Long): Unit 
= {
+  checkHiveHash(
+DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get,
+DateType,
+expected)
+}
+
+// basic case
+checkHiveHashForDateType("2017-01-01", 17167)
+
+// boundary cases
+checkHiveHashForDateType("-01-01", -719530)
+checkHiveHashForDateType("-12-31", 2932896)
+
+// epoch
+checkHiveHashForDateType("1970-01-01", 0)
+
+// before epoch
+checkHiveHashForDateType("1800-01-01", -62091)
+
+// Invalid input: bad date string. Hive returns 0 for such cases
--- End diff --

Spark does not allow creating `Date` which do not fit its spec and throws 
exception. Hive will not fail but fallback to `null` and return `0` as hash 
value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...

2017-02-27 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/17062#discussion_r103357472
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
 ---
@@ -169,6 +171,96 @@ class HashExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 // scalastyle:on nonascii
   }
 
+  test("hive-hash for date type") {
+def checkHiveHashForDateType(dateString: String, expected: Long): Unit 
= {
+  checkHiveHash(
+DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get,
+DateType,
+expected)
+}
+
+// basic case
+checkHiveHashForDateType("2017-01-01", 17167)
+
+// boundary cases
+checkHiveHashForDateType("-01-01", -719530)
+checkHiveHashForDateType("-12-31", 2932896)
+
+// epoch
+checkHiveHashForDateType("1970-01-01", 0)
+
+// before epoch
+checkHiveHashForDateType("1800-01-01", -62091)
+
+// Invalid input: bad date string. Hive returns 0 for such cases
+intercept[NoSuchElementException](checkHiveHashForDateType("0-0-0", 0))
+
intercept[NoSuchElementException](checkHiveHashForDateType("-1212-01-01", 0))
+
intercept[NoSuchElementException](checkHiveHashForDateType("2016-99-99", 0))
+
+// Invalid input: Empty string. Hive returns 0 for this case
+intercept[NoSuchElementException](checkHiveHashForDateType("", 0))
+
+// Invalid input: February 30th for a leap year. Hive supports this 
but Spark doesn't
+
intercept[NoSuchElementException](checkHiveHashForDateType("2016-02-30", 16861))
+  }
+
+  test("hive-hash for timestamp type") {
+def checkHiveHashForTimestampType(
+timestamp: String,
+expected: Long,
+timeZone: TimeZone = TimeZone.getTimeZone("UTC")): Unit = {
+  checkHiveHash(
+DateTimeUtils.stringToTimestamp(UTF8String.fromString(timestamp), 
timeZone).get,
+TimestampType,
+expected)
+}
+
+// basic case
+checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445725271)
+
+// with higher precision
+checkHiveHashForTimestampType("2017-02-24 10:56:29.11", 1353936655)
+
+// with different timezone
+checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445732471,
+  TimeZone.getTimeZone("US/Pacific"))
+
+// boundary cases
+checkHiveHashForTimestampType("0001-01-01 00:00:00", 1645926784)
+checkHiveHashForTimestampType("-01-01 00:00:00", -1081818240)
+
+// epoch
+checkHiveHashForTimestampType("1970-01-01 00:00:00", 0)
+
+// before epoch
+checkHiveHashForTimestampType("1800-01-01 03:12:45", -267420885)
+
+// Invalid input: bad timestamp string. Hive returns 0 for such cases
+intercept[NoSuchElementException](checkHiveHashForTimestampType("0-0-0 
0:0:0", 0))
+
intercept[NoSuchElementException](checkHiveHashForTimestampType("-99-99-99 
99:99:45", 0))
+
intercept[NoSuchElementException](checkHiveHashForTimestampType("55-5-",
 0))
+
+// Invalid input: Empty string. Hive returns 0 for this case
+intercept[NoSuchElementException](checkHiveHashForTimestampType("", 0))
+
+// Invalid input: February 30th for a leap year. Hive supports this 
but Spark doesn't
+
intercept[NoSuchElementException](checkHiveHashForTimestampType("2016-02-30 
00:00:00", 0))
+
+// Invalid input: Hive accepts upto 9 decimal place precision but 
Spark uses upto 6
+
intercept[TestFailedException](checkHiveHashForTimestampType("2017-02-24 
10:56:29.", 0))
+  }
+
+  test("hive-hash for CalendarInterval type") {
+def checkHiveHashForTimestampType(interval: String, expected: Long): 
Unit = {
+  checkHiveHash(CalendarInterval.fromString(interval), 
CalendarIntervalType, expected)
+}
+
+checkHiveHashForTimestampType("interval 1 day", 3220073)
+checkHiveHashForTimestampType("interval 6 day 15 hour", 21202073)
--- End diff --

SELECT HASH ( INTERVAL '1' DAY + INTERVAL '15' HOUR );


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17062: [SPARK-17495] [SQL] Support date, timestamp and i...

2017-02-27 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/17062#discussion_r103300293
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
 ---
@@ -169,6 +171,96 @@ class HashExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 // scalastyle:on nonascii
   }
 
+  test("hive-hash for date type") {
+def checkHiveHashForDateType(dateString: String, expected: Long): Unit 
= {
+  checkHiveHash(
+DateTimeUtils.stringToDate(UTF8String.fromString(dateString)).get,
+DateType,
+expected)
+}
+
+// basic case
+checkHiveHashForDateType("2017-01-01", 17167)
+
+// boundary cases
+checkHiveHashForDateType("-01-01", -719530)
+checkHiveHashForDateType("-12-31", 2932896)
+
+// epoch
+checkHiveHashForDateType("1970-01-01", 0)
+
+// before epoch
+checkHiveHashForDateType("1800-01-01", -62091)
+
+// Invalid input: bad date string. Hive returns 0 for such cases
+intercept[NoSuchElementException](checkHiveHashForDateType("0-0-0", 0))
+
intercept[NoSuchElementException](checkHiveHashForDateType("-1212-01-01", 0))
+
intercept[NoSuchElementException](checkHiveHashForDateType("2016-99-99", 0))
+
+// Invalid input: Empty string. Hive returns 0 for this case
+intercept[NoSuchElementException](checkHiveHashForDateType("", 0))
+
+// Invalid input: February 30th for a leap year. Hive supports this 
but Spark doesn't
+
intercept[NoSuchElementException](checkHiveHashForDateType("2016-02-30", 16861))
+  }
+
+  test("hive-hash for timestamp type") {
+def checkHiveHashForTimestampType(
+timestamp: String,
+expected: Long,
+timeZone: TimeZone = TimeZone.getTimeZone("UTC")): Unit = {
+  checkHiveHash(
+DateTimeUtils.stringToTimestamp(UTF8String.fromString(timestamp), 
timeZone).get,
+TimestampType,
+expected)
+}
+
+// basic case
+checkHiveHashForTimestampType("2017-02-24 10:56:29", 1445725271)
--- End diff --

Corresponding hive query.
```
select HASH(CAST("2017-02-24 10:56:29" AS TIMESTAMP));
```

Note that this is with system's timezone set to UTC (export 
TZ=/usr/share/zoneinfo/UTC)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17062: [SPARK-17495] [SQL] Support date, timestamp and interval...

2017-02-27 Thread tejasapatil
Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/17062
  
@gatorsmile : can you please review this PR ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17012: [SPARK-19677][SS] Renaming a file atop an existin...

2017-02-27 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/17012#discussion_r103359940
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala
 ---
@@ -274,7 +274,9 @@ private[state] class HDFSBackedStateStoreProvider(
   private def commitUpdates(newVersion: Long, map: MapType, tempDeltaFile: 
Path): Path = {
 synchronized {
   val finalDeltaFile = deltaFile(newVersion)
-  if (!fs.rename(tempDeltaFile, finalDeltaFile)) {
+  // Renaming a file atop an existing one fails on HDFS, see
--- End diff --

Could you add our discussion to the comment? Such as
```
  // scalastyle:off
  // Renaming a file atop an existing one fails on HDFS
  // 
(http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html).
  // Hence we should either skip the rename step or delete the target 
file. Because deleting the
  // target file will beak speculation, skipping the rename step is the 
only choice. It's still
  // semantically correct because Structured Streaming requires 
rerunning a batch should
  // generate the same output. (SPARK-19677)
  // scalastyle:on
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17082: [SPARK-19749][SS] Name socket source with a meani...

2017-02-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17082


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16954: [SPARK-18874][SQL] First phase: Deferring the cor...

2017-02-27 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/16954#discussion_r103362319
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -123,19 +123,36 @@ case class Not(child: Expression)
  */
 @ExpressionDescription(
   usage = "expr1 _FUNC_(expr2, expr3, ...) - Returns true if `expr` equals 
to any valN.")
-case class In(value: Expression, list: Seq[Expression]) extends Predicate
-with ImplicitCastInputTypes {
+case class In(value: Expression, list: Seq[Expression]) extends Predicate {
 
   require(list != null, "list should not be null")
-
-  override def inputTypes: Seq[AbstractDataType] = value.dataType +: 
list.map(_.dataType)
-
   override def checkInputDataTypes(): TypeCheckResult = {
-if (list.exists(l => l.dataType != value.dataType)) {
-  TypeCheckResult.TypeCheckFailure(
-"Arguments must be same type")
-} else {
-  TypeCheckResult.TypeCheckSuccess
+list match {
+  case ListQuery(sub, _, _) :: Nil =>
--- End diff --

@hvanhovell Actually previously checkInputDataTypes() called from 
checkAnalysis never had to deal with **In subquery expressions** as it gets 
rewritten to PredicateSubquery. With the change in this PR we now see the 
original IN subquery expressions and thus we need to make sure the types are 
same between LHS and RHS of the IN subquery expression.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16917: [SPARK-19529][BRANCH-1.6] Backport PR #16866 to branch-1...

2017-02-27 Thread yhuai
Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16917
  
Let's use a meaningful title in future :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17093: [SPARK-19761][SQL]create InMemoryFileIndex with a...

2017-02-27 Thread windpiger
Github user windpiger commented on a diff in the pull request:

https://github.com/apache/spark/pull/17093#discussion_r103377566
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala
 ---
@@ -178,6 +179,12 @@ class FileIndexSuite extends SharedSQLContext {
   assert(catalog2.allFiles().nonEmpty)
 }
   }
+
+  test("InMemoryFileIndex with empty rootPaths when 
PARALLEL_PARTITION_DISCOVERY_THRESHOLD is 0") {
--- End diff --

I think the user should not set it to a negative number initiatively, but 
it is better if we can cover these situations~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13036: [SPARK-15243][ML][SQL][PYSPARK] Param methods sho...

2017-02-27 Thread sethah
Github user sethah closed the pull request at:

https://github.com/apache/spark/pull/13036


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13036: [SPARK-15243][ML][SQL][PYSPARK] Param methods should use...

2017-02-27 Thread sethah
Github user sethah commented on the issue:

https://github.com/apache/spark/pull/13036
  
@holdenk please feel free to take this over. Can't find time to work on it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16954: [SPARK-18874][SQL] First phase: Deferring the cor...

2017-02-27 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/16954#discussion_r103354299
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -40,19 +42,179 @@ abstract class PlanExpression[T <: QueryPlan[_]] 
extends Expression {
 /**
  * A base interface for expressions that contain a [[LogicalPlan]].
  */
-abstract class SubqueryExpression extends PlanExpression[LogicalPlan] {
+abstract class SubqueryExpression(
+plan: LogicalPlan,
+children: Seq[Expression],
+exprId: ExprId) extends PlanExpression[LogicalPlan] {
+
+  override lazy val resolved: Boolean = childrenResolved && plan.resolved
+  override lazy val references: AttributeSet =
+if (plan.resolved) super.references -- plan.outputSet else 
super.references
   override def withNewPlan(plan: LogicalPlan): SubqueryExpression
+  override def semanticEquals(o: Expression): Boolean = o match {
+case p: SubqueryExpression =>
+  this.getClass.getName.equals(p.getClass.getName) && 
plan.sameResult(p.plan) &&
+children.length == p.children.length &&
+children.zip(p.children).forall(p => p._1.semanticEquals(p._2))
+case _ => false
+  }
 }
 
 object SubqueryExpression {
+  /**
+   * Returns true when an expression contains an IN or EXISTS subquery and 
false otherwise.
+   */
+  def hasInOrExistsSubquery(e: Expression): Boolean = {
+e.find {
+  case _: ListQuery | _: Exists => true
+  case _ => false
+}.isDefined
+  }
+
+  /**
+   * Returns true when an expression contains a subquery that has outer 
reference(s). The outer
+   * reference attributes are kept as children of subquery expression by
+   * [[org.apache.spark.sql.catalyst.analysis.Analyzer.ResolveSubquery]]
+   */
   def hasCorrelatedSubquery(e: Expression): Boolean = {
 e.find {
-  case e: SubqueryExpression if e.children.nonEmpty => true
+  case s: SubqueryExpression if s.children.nonEmpty => true
   case _ => false
 }.isDefined
   }
 }
 
+object SubExprUtils extends PredicateHelper {
+  /**
+   * Returns true when an expression contains correlated predicates i.e 
outer references and
+   * returns false otherwise.
+   */
+  def containsOuter(e: Expression): Boolean = {
+e.find(_.isInstanceOf[OuterReference]).isDefined
+  }
+
+  /**
+   * Returns whether there are any null-aware predicate subqueries inside 
Not. If not, we could
+   * turn the null-aware predicate into not-null-aware predicate.
+   */
+  def hasNullAwarePredicateWithinNot(e: Expression): Boolean = {
+e.find{ x =>
+  x.isInstanceOf[Not] && e.find {
+case In(_, Seq(_: ListQuery)) => true
+case _ => false
+  }.isDefined
+}.isDefined
+  }
+
+  /**
+   * Returns an expression after removing the OuterReference shell.
+   */
+  def stripOuterReference(e: Expression): Expression = {
+e.transform {
+  case OuterReference(r) => r
+}
+  }
+
+  /**
+   * Returns the list of expressions after removing the OuterReference 
shell from each of
+   * the expression.
+   */
+  def stripOuterReferences(e: Seq[Expression]): Seq[Expression] = 
e.map(stripOuterReference)
+
+  /**
+   * Returns the logical plan after removing the OuterReference shell from 
all the expressions
+   * of the input logical plan.
+   */
+  def stripOuterReferences(p: LogicalPlan): LogicalPlan = {
+p.transformAllExpressions {
+  case OuterReference(a) => a
+}
+  }
+
+  /**
+   * Given a list of expressions, returns the expressions which have outer 
references. Aggregate
+   * expressions are treated in a special way. If the children of 
aggregate expression contains an
+   * outer reference, then the entire aggregate expression is marked as an 
outer reference.
+   * Example (SQL):
+   * {{{
+   *   SELECT a FROM l GROUP by 1 HAVING EXISTS (SELECT 1 FROM r WHERE d < 
min(b))
+   * }}}
+   * In the above case, we want to mark the entire min(b) as an outer 
reference
+   * OuterReference(min(b)) instead of min(OuterReference(b)).
+   * TODO: Currently we don't allow deep correlation. Also, we don't allow 
mixing of
+   * outer references and local references under an aggregate expression.
+   * For example (SQL):
+   * {{{
+   *   SELECT .. FROM p1
+   *   WHERE EXISTS (SELECT ...
+   * FROM p2
+   * WHERE EXISTS (SELECT ...
+   *   FROM sq
+   *   WHERE min(p1.a + p2.b) = sq.c))
+   *
+   *  

[GitHub] spark pull request #14273: [SPARK-9140] [ML] Replace TimeTracker by MultiSto...

2017-02-27 Thread MechCoder
Github user MechCoder closed the pull request at:

https://github.com/apache/spark/pull/14273


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16959: [SPARK-19631][CORE] OutputCommitCoordinator shoul...

2017-02-27 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/16959#discussion_r103354382
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
@@ -111,13 +115,13 @@ private[spark] class OutputCommitCoordinator(conf: 
SparkConf, isDriver: Boolean)
 val arr = new Array[TaskAttemptNumber](maxPartitionId + 1)
 java.util.Arrays.fill(arr, NO_AUTHORIZED_COMMITTER)
 synchronized {
-  authorizedCommittersByStage(stage) = arr
+  stageStates(stage) = new StageState(arr)
--- End diff --

ah sorry now that I see this I realized it probably makes sense to 
initialize `arr` in the StageState constructor too (so this line would look 
like `new StageState(maxPartitionId +1)`, and the `StageState` constructor just 
takes in `numPartitions`).  Would you mind making that change too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17012: [SPARK-19677][SS] Renaming a file atop an existing one s...

2017-02-27 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/17012
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17092: [SPARK-18450][ML] Scala API Change for LSH AND-am...

2017-02-27 Thread Yunni
GitHub user Yunni opened a pull request:

https://github.com/apache/spark/pull/17092

[SPARK-18450][ML] Scala API Change for LSH AND-amplification

## What changes were proposed in this pull request?
Implemented a new Param numHashFunctions as the dimension of 
AND-amplification for Locality Sensitive Hashing. Now the hash of each feature 
in LSH is an array of size numHashTables while each element in the array is a 
vector of size numHashFunctions.

Two features are in the same hash bucket iff ANY pair of the vectors are 
equal (OR-amplification). Two vectors are equal iff ALL pair of the vector 
entries are equal (AND-amplification).

Will create follow-up PRs for Python API and Doc/Examples.

## How was this patch tested?
By running unit tests MinHashLSHSuite and BucketedRandomProjectionLSHSuite.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Yunni/spark SPARK-18450

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17092.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17092


commit e6f9f9541f0b00c14b7c5a201b22aeb400eb9f19
Author: Yun Ni 
Date:   2017-02-16T20:54:22Z

Scala API Change for AND-amplification

commit 010acb2caf69ca0822db6aeb866cce21cdfcce4b
Author: Yunni 
Date:   2017-02-27T03:43:21Z

Merge branch 'SPARK-18450' of https://github.com/Yunni/spark into 
SPARK-18450

commit 83a155699df4b15f1ab1fc427730613b63f7d1d6
Author: Yunni 
Date:   2017-02-27T04:04:37Z

Fix typos in unit tests

commit 9dd87ba21a025939df7020ff1491a2c6c29f2d93
Author: Yunni 
Date:   2017-02-28T02:04:10Z

Merge branch 'master' of https://github.com/apache/spark into SPARK-18450




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17012: [SPARK-19677][SS] Renaming a file atop an existin...

2017-02-27 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/17012#discussion_r103361529
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreSuite.scala
 ---
@@ -682,6 +684,21 @@ private[state] object StateStoreSuite {
 }
 
 /**
+  * Fake FileSystem that simulates HDFS rename semantic, i.e. renaming a 
file atop an existing
--- End diff --

nit: indent


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-27 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r103361528
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,196 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing (LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel(JavaModel):
+"""
+Mixin for Locality Sensitive Hashing (LSH) models.
+"""
+
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+.. note:: This method is experimental and will likely change 
behavior in the next release.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A column "distCol" is
+ added to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two datasets to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each pair of rows. Use
+"distCol" as default value if it's not specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+ "datasetA" and "datasetB", and a column "distCol" is 
added to show the distance
+ between each pair.
+"""
+return self._call_java("approxSimilarityJoin", datasetA, datasetB, 
threshold, distCol)
+
+
+@inherit_doc
+class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, 
HasOutputCol, HasSeed,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+LSH class for Euclidean distance metrics.
+The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
+distance space. The output will be vectors of configurable dimension. 
Hash values in the same
+dimension are calculated by the same hash function.
+
+.. seealso:: `Stable Distributions \
+
`_
+.. seealso:: `Hashing for Similarity Search: A Survey 
`_
+
+>>> from pyspark.ml.linalg import Vectors
+>>> from pyspark.sql.functions import col
+>>> data = [(0, Vectors.dense([-1.0, -1.0 ]),),
+... (1, Vectors.dense([-1.0, 1.0 

[GitHub] spark issue #17092: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17092
  
**[Test build #73550 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73550/testReport)**
 for PR 17092 at commit 
[`9dd87ba`](https://github.com/apache/spark/commit/9dd87ba21a025939df7020ff1491a2c6c29f2d93).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17093: [SPARK-19761][SQL]create InMemoryFileIndex with an empty...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17093
  
**[Test build #73552 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73552/testReport)**
 for PR 17093 at commit 
[`96898a2`](https://github.com/apache/spark/commit/96898a2332c64b101efc54d1ccbbf29102b88e68).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17093: [SPARK-19761][SQL]create InMemoryFileIndex with an empty...

2017-02-27 Thread windpiger
Github user windpiger commented on the issue:

https://github.com/apache/spark/pull/17093
  
cc @cloud-fan @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

2017-02-27 Thread sueann
Github user sueann commented on a diff in the pull request:

https://github.com/apache/spark/pull/17090#discussion_r103366357
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala 
---
@@ -285,6 +285,43 @@ class ALSModel private[ml] (
 
   @Since("1.6.0")
   override def write: MLWriter = new ALSModel.ALSModelWriter(this)
+
+  @Since("2.2.0")
+  def recommendForAllUsers(num: Int): DataFrame = {
+recommendForAll(userFactors, itemFactors, $(userCol), num)
+  }
+
+  @Since("2.2.0")
+  def recommendForAllItems(num: Int): DataFrame = {
+recommendForAll(itemFactors, userFactors, $(itemCol), num)
+  }
+
+  /**
+   * Makes recommendations for all users (or items).
+   * @param srcFactors src factors for which to generate recommendations
+   * @param dstFactors dst factors used to make recommendations
+   * @param srcOutputColumn name of the column for the source in the 
output DataFrame
+   * @param num number of recommendations for each record
+   * @return a DataFrame of (srcOutputColumn: Int, recommendations), where 
recommendations are
+   * stored as an array of (dstId: Int, ratingL: Double) tuples.
+   */
+  private def recommendForAll(
+  srcFactors: DataFrame,
+  dstFactors: DataFrame,
+  srcOutputColumn: String,
+  num: Int): DataFrame = {
+import srcFactors.sparkSession.implicits._
+
+val ratings = srcFactors.crossJoin(dstFactors)
+  .select(
+srcFactors("id").as("srcId"),
+dstFactors("id").as("dstId"),
+predict(srcFactors("features"), 
dstFactors("features")).as($(predictionCol)))
+// We'll force the IDs to be Int. Unfortunately this converts IDs to 
Int in the output.
+val topKAggregator = new TopByKeyAggregator[Int, Int, Float](num, 
Ordering.by(_._2))
+ratings.as[(Int, Int, 
Float)].groupByKey(_._1).agg(topKAggregator.toColumn)
--- End diff --

I'm not sure what a good way to do this is :-/ Ways I can think of but 
haven't succeeded in:
1/ change the schema of the entire DataFrame
2/ map over the rows in the DataFrame {
  map over the items in the array {
convert from tuple (really a Row) to a Row with a different schema
  }
}

I tried using RowEncoder in either case, but the types haven't quite worked 
out. Any ideas?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17090
  
**[Test build #73553 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73553/testReport)**
 for PR 17090 at commit 
[`ebd2604`](https://github.com/apache/spark/commit/ebd26043fc9432d41b83612dfefcc27229d318cb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17094: [SPARK-19762][ML] Hierarchy for consolidating ML ...

2017-02-27 Thread sethah
GitHub user sethah opened a pull request:

https://github.com/apache/spark/pull/17094

[SPARK-19762][ML] Hierarchy for consolidating ML aggregator/loss code

## What changes were proposed in this pull request?

JIRA: [SPARK-19762](https://issues.apache.org/jira/browse/SPARK-19762)

This patch is a WIP. 

The larger changes in this patch are:

* Adds a `DifferentiableLossAggregator` trait which is intended to be used 
as a common parent trait to all Spark ML aggregator classes. It factors out the 
common methods: `merge, gradient, loss, weight` from the aggregator subclasses.
* Adds a `RDDLossFunction` which is intended to be the only implementation 
of Breeze's `DiffFunction` necessary in Spark ML, and can be used by all other 
algorithms. It takes the aggregator type as a type parameter, and maps the 
aggregator over an RDD. It additionally takes in a optional regularization loss 
function for applying the differentiable part of regularization.
* Factors out the regularization from the data part of the cost function, 
and treats regularization as a separate independent cost function which can be 
evaluated and added to the data cost function.
* Changes `LinearRegression` to use this new hierarchy as a proof of 
concept.
* Adds the following new namespaces `o.a.s.ml.optim.loss` and 
`o.a.s.ml.optim.aggregator`

**NOTE: The large majority of the "lines added" and "lines deleted" are 
simply code moving around or unit tests.**

BTW, I also converted LinearSVC to this framework as a way to prove that 
this new hierarchy is flexible enough for the other algorithms, but I backed 
those changes out because the PR is large enough as is. 

## How was this patch tested?
Test suites are added for the new components, and some test suites are also 
added to provide coverage where there wasn't any before.

* DifferentiablLossAggregatorSuite
* LeastSquaresAggregatorSuite
* RDDLossFunctionSuite
* DifferentiableRegularizationSuite

I would additionally like to run some performance/scale tests with linear 
regression to ensure that there are no regressions. This patch is WIP until I 
can complete the tests. Since the design will likely have some iteration, I'd 
like to have it open for review before the scale tests are done.

## Follow ups

If this design is accepted, we will convert the other ML algorithms that 
use this aggregator pattern to this new hierarchy in follow up PRs. 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sethah/spark ml_aggregators

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17094.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17094


commit d6fae000d95284598e41d8bf95eb7067d8970e69
Author: sethah 
Date:   2017-02-27T19:03:03Z

consolidate ml aggregators

commit 86b56001a82f43fe1342bb1c26c6edcce6523865
Author: sethah 
Date:   2017-02-27T20:29:14Z

curried constructors

commit 06e547bdfb38d3b428a4a48c681aea989a11d625
Author: sethah 
Date:   2017-02-27T21:06:59Z

self types and docs

commit c930ced63b5c1faebe8063c1bf90a26cf9fae2be
Author: sethah 
Date:   2017-02-27T22:25:27Z

aggregator test suite

commit 6a596f23c855b2da0d9ba9133dee2f311dceb615
Author: sethah 
Date:   2017-02-27T23:03:16Z

loss function suite

commit 4b36119652173fff30c5869694015e1519753a05
Author: sethah 
Date:   2017-02-27T23:50:24Z

ls agg tests

commit ac55f06238cc9043ac2eaf282c3f8513a1a97076
Author: sethah 
Date:   2017-02-28T00:37:16Z

all tests passing, still need tests for regularization

commit ab5151ea41cde7d898bd65b998f674da3a5975ea
Author: sethah 
Date:   2017-02-28T01:07:59Z

regularization suite

commit 0366a8eefcef39c3251c9a7050944ada03bb4f47
Author: sethah 
Date:   2017-02-28T01:14:50Z

backing out svc changes

commit 28b88e48027959e0574c9d13236daff44fcdf650
Author: sethah 
Date:   2017-02-28T01:50:56Z

style cleanups and documentation

commit 9a04d0bc51bed29bca28a5e34ebc5b614b6560d2
Author: sethah 
Date:   2017-02-28T03:15:11Z

tolerances and imports




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark issue #17088: [SPARK-19753][CORE] All shuffle files on a host should b...

2017-02-27 Thread sitalkedia
Github user sitalkedia commented on the issue:

https://github.com/apache/spark/pull/17088
  
>> This is quite drastic for a fetch failure : spark already has mechanisms 
in place to detect executor/host failure - which take care of these failure 
modes.

Unfortunately,  mechanisms already in place are not sufficient. Let's 
imagine a situation where the shuffle service become unresponsive or OOMs, in 
that case, we will not see any host failure, still the driver will receive 
fetch failure. Current model assumes all shuffle output for an executor is 
lost, however, since the shuffle service serves all the shuffle files on that 
host, we should mark all the shuffle files on that host as unavailable. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16959: [SPARK-19631][CORE] OutputCommitCoordinator should not a...

2017-02-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16959
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73544/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16959: [SPARK-19631][CORE] OutputCommitCoordinator should not a...

2017-02-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16959
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17094: [SPARK-19762][ML][WIP] Hierarchy for consolidating ML ag...

2017-02-27 Thread sethah
Github user sethah commented on the issue:

https://github.com/apache/spark/pull/17094
  
Jenkins test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17095: [SPARK-19763][SQL]qualified external datasource t...

2017-02-27 Thread windpiger
GitHub user windpiger opened a pull request:

https://github.com/apache/spark/pull/17095

[SPARK-19763][SQL]qualified external datasource table location stored in 
catalog

## What changes were proposed in this pull request?

If we create a external datasource table with a non-qualified location , we 
should qualified it to store in catalog.

```
CREATE TABLE t(a string)
USING parquet
LOCATION '/path/xx'


CREATE TABLE t1(a string, b string)
USING parquet
PARTITIONED BY(b)
LOCATION '/path/xx'
```

when we get the table from catalog, the location should be qualified, 
e.g.'file:/path/xxx' 
## How was this patch tested?
unit test added

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/windpiger/spark tablepathQualified

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17095.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17095


commit 570ce24bee80dad5b2e897db34d04f3752139555
Author: windpiger 
Date:   2017-02-28T03:49:55Z

[SPARK-19763][SQL]qualified external datasource table location stored in 
catalog




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17015: [SPARK-19678][SQL] remove MetastoreRelation

2017-02-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17015#discussion_r103383279
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -349,36 +350,41 @@ object CatalogTypes {
 
 
 /**
- * An interface that is implemented by logical plans to return the 
underlying catalog table.
- * If we can in the future consolidate SimpleCatalogRelation and 
MetastoreRelation, we should
- * probably remove this interface.
+ * A [[LogicalPlan]] that represents a table.
  */
-trait CatalogRelation {
-  def catalogTable: CatalogTable
-  def output: Seq[Attribute]
-}
+case class CatalogRelation(
+tableMeta: CatalogTable,
+dataCols: Seq[Attribute],
+partitionCols: Seq[Attribute]) extends LeafNode with 
MultiInstanceRelation {
+  assert(tableMeta.identifier.database.isDefined)
+  assert(tableMeta.partitionSchema.sameType(partitionCols.toStructType))
+  assert(tableMeta.dataSchema.sameType(dataCols.toStructType))
+
+  // The partition column should always appear after data columns.
+  override def output: Seq[Attribute] = dataCols ++ partitionCols
+
+  def isPartitioned: Boolean = partitionCols.nonEmpty
+
+  override def equals(relation: Any): Boolean = relation match {
+case other: CatalogRelation => tableMeta == other.tableMeta && output 
== other.output
+case _ => false
+  }
 
+  override def hashCode(): Int = {
+Objects.hashCode(tableMeta.identifier, output)
+  }
 
-/**
- * A [[LogicalPlan]] that wraps [[CatalogTable]].
- *
- * Note that in the future we should consolidate this and 
HiveCatalogRelation.
- */
-case class SimpleCatalogRelation(
-metadata: CatalogTable)
-  extends LeafNode with CatalogRelation {
-
-  override def catalogTable: CatalogTable = metadata
-
-  override lazy val resolved: Boolean = false
-
-  override val output: Seq[Attribute] = {
-val (partCols, dataCols) = metadata.schema.toAttributes
-  // Since data can be dumped in randomly with no validation, 
everything is nullable.
-  
.map(_.withNullability(true).withQualifier(Some(metadata.identifier.table)))
-  .partition { a =>
-metadata.partitionColumnNames.contains(a.name)
-  }
-dataCols ++ partCols
+  /** Only compare table identifier. */
+  override lazy val cleanArgs: Seq[Any] = Seq(tableMeta.identifier)
+
+  override def computeStats(conf: CatalystConf): Statistics = {
+// For data source tables, we will create a `LogicalRelation` and 
won't call this method, for
+// hive serde tables, we will always generate a statistics.
+// TODO: unify the table stats generation.
+tableMeta.stats.map(_.toPlanStats(output)).get
--- End diff --

Yeah, the value should be always filled by `DetermineTableStats`, but maybe 
we still can issue an exception when it is `None`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17015: [SPARK-19678][SQL] remove MetastoreRelation

2017-02-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17015#discussion_r103387529
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -90,10 +74,10 @@ object AnalyzeColumnCommand extends Logging {
*/
   def computeColumnStats(
--- End diff --

Now, this is not being used for testing. We can mark it as private. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17079: [SPARK-19748][SQL]refresh function has a wrong order to ...

2017-02-27 Thread windpiger
Github user windpiger commented on the issue:

https://github.com/apache/spark/pull/17079
  
there is no related test case for InMemoryFileIndex with FileStatusCache.
When I do this [PR](https://github.com/apache/spark/pull/17081), and add a 
fileStatusCache in DataSource, I found this bug..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17012: [SPARK-19677][SS] Renaming a file atop an existing one s...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17012
  
**[Test build #73548 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73548/testReport)**
 for PR 17012 at commit 
[`530c027`](https://github.com/apache/spark/commit/530c027e8ac22caa6fac3770ae24c6727ab7c018).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17047: [SPARK-19720][SPARK SUBMIT] Redact sensitive information...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17047
  
**[Test build #73554 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73554/testReport)**
 for PR 17047 at commit 
[`7753998`](https://github.com/apache/spark/commit/7753998f0a21073a05897b8945c8e61a1fe4fc84).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16819: [SPARK-16441][YARN] Set maxNumExecutor depends on yarn c...

2017-02-27 Thread wangyum
Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/16819
  
@vanzin What do you think about current approach? I have tested on a same 
Spark hive-thriftserver, the `spark.dynamicAllocation.maxExecutors` wiil 
decrease if I kill 4 NodeManager:
```
17/02/27 15:58:08 DEBUG ExecutorAllocationManager: Not adding executors 
because our current target total is already 94 (limit 94)
17/02/27 15:58:09 DEBUG ExecutorAllocationManager: Not adding executors 
because our current target total is already 94 (limit 94)
17/02/27 16:05:49 DEBUG ExecutorAllocationManager: Not adding executors 
because our current target total is already 85 (limit 85)
17/02/27 16:05:49 DEBUG ExecutorAllocationManager: Not adding executors 
because our current target total is already 85 (limit 85)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17094: [SPARK-19762][ML][WIP] Hierarchy for consolidating ML ag...

2017-02-27 Thread sethah
Github user sethah commented on the issue:

https://github.com/apache/spark/pull/17094
  
ping @MLnick @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17093: [SPARK-19761][SQL]create InMemoryFileIndex with an empty...

2017-02-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17093
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17093: [SPARK-19761][SQL]create InMemoryFileIndex with an empty...

2017-02-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17093
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73552/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17083: [SPARK-19750][UI][branch-2.1] Fix redirect issue from ht...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17083
  
**[Test build #73551 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73551/testReport)**
 for PR 17083 at commit 
[`9ee6a09`](https://github.com/apache/spark/commit/9ee6a096504d91809c1cec7b7b0b525d54646300).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17052: [SPARK-19690][SS] Join a streaming DataFrame with a batc...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17052
  
**[Test build #73559 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73559/testReport)**
 for PR 17052 at commit 
[`59f4272`](https://github.com/apache/spark/commit/59f4272ee97b77bc1aaedb7daf63acf1b417d58e).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17052: [SPARK-19690][SS] Join a streaming DataFrame with a batc...

2017-02-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17052
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17052: [SPARK-19690][SS] Join a streaming DataFrame with a batc...

2017-02-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17052
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73559/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17067: [SPARK-19602][SQL][TESTS] Add tests for qualified column...

2017-02-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17067
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   >