[jira] [Updated] (SPARK-21358) Argument of repartitionandsortwithinpartitions at pyspark

2017-07-10 Thread chie hayashida (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chie hayashida updated SPARK-21358:
---
Description: 
In rdd.py, implementation of repartitionandsortwithinpartitions is below.

{code}
 def repartitionAndSortWithinPartitions(self, numPartitions=None, 
partitionFunc=portable_hash,
   ascending=True, keyfunc=lambda x: x):
{code}

And at document, there is following sample script.
{code}
>>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
3)])
>>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 2)
{code}

The third argument (ascending) expected to be boolean, so following script is 
better, I think.
{code}
>>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
3)])
>>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 
True)
{code}



  was:
In rdd.py, implementation of repartitionandsortwithinpartitions is below.

{code:python}
 def repartitionAndSortWithinPartitions(self, numPartitions=None, 
partitionFunc=portable_hash,
   ascending=True, keyfunc=lambda x: x):
{code}

And at document, there is following sample script.
{code:python}
>>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
3)])
>>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 2)
{code}

The third argument (ascending) expected to be boolean, so following script is 
better, I think.
{code:python}
>>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
3)])
>>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 
True)
{code}




> Argument of repartitionandsortwithinpartitions at pyspark
> -
>
> Key: SPARK-21358
> URL: https://issues.apache.org/jira/browse/SPARK-21358
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Examples
>Affects Versions: 2.1.1
>Reporter: chie hayashida
>Priority: Minor
>
> In rdd.py, implementation of repartitionandsortwithinpartitions is below.
> {code}
>  def repartitionAndSortWithinPartitions(self, numPartitions=None, 
> partitionFunc=portable_hash,
>ascending=True, keyfunc=lambda x: 
> x):
> {code}
> And at document, there is following sample script.
> {code}
> >>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
> 3)])
> >>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 
> 2)
> {code}
> The third argument (ascending) expected to be boolean, so following script is 
> better, I think.
> {code}
> >>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
> 3)])
> >>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 
> True)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21358) Argument of repartitionandsortwithinpartitions at pyspark

2017-07-10 Thread chie hayashida (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chie hayashida updated SPARK-21358:
---
Description: 
In rdd.py, implementation of repartitionandsortwithinpartitions is below.

{code:python}
 def repartitionAndSortWithinPartitions(self, numPartitions=None, 
partitionFunc=portable_hash,
   ascending=True, keyfunc=lambda x: x):
{code}

And at document, there is following sample script.
{code:python}
>>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
3)])
>>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 2)
{code}

The third argument (ascending) expected to be boolean, so following script is 
better, I think.
{code:python}
>>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
3)])
>>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 
True)
{code}



  was:
In rdd.py, implementation of repartitionandsortwithinpartitions is below.

```
   def repartitionAndSortWithinPartitions(self, numPartitions=None, 
partitionFunc=portable_hash,
   ascending=True, keyfunc=lambda x: x):

```
And at document, there is following sample script.

```
>>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
3)])
>>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 2)
```

The third argument (ascending) expected to be boolean, so following script is 
better, I think.
```
>>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
3)])
>>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 
True)
```



> Argument of repartitionandsortwithinpartitions at pyspark
> -
>
> Key: SPARK-21358
> URL: https://issues.apache.org/jira/browse/SPARK-21358
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Examples
>Affects Versions: 2.1.1
>Reporter: chie hayashida
>Priority: Minor
>
> In rdd.py, implementation of repartitionandsortwithinpartitions is below.
> {code:python}
>  def repartitionAndSortWithinPartitions(self, numPartitions=None, 
> partitionFunc=portable_hash,
>ascending=True, keyfunc=lambda x: 
> x):
> {code}
> And at document, there is following sample script.
> {code:python}
> >>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
> 3)])
> >>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 
> 2)
> {code}
> The third argument (ascending) expected to be boolean, so following script is 
> better, I think.
> {code:python}
> >>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
> 3)])
> >>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 
> True)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21358) Argument of repartitionandsortwithinpartitions at pyspark

2017-07-09 Thread chie hayashida (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chie hayashida updated SPARK-21358:
---
Summary: Argument of repartitionandsortwithinpartitions at pyspark  (was: 
variable of repartitionandsortwithinpartitions at pyspark)

> Argument of repartitionandsortwithinpartitions at pyspark
> -
>
> Key: SPARK-21358
> URL: https://issues.apache.org/jira/browse/SPARK-21358
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Examples
>Affects Versions: 2.1.1
>Reporter: chie hayashida
>Priority: Minor
>
> In rdd.py, implementation of repartitionandsortwithinpartitions is below.
> ```
>def repartitionAndSortWithinPartitions(self, numPartitions=None, 
> partitionFunc=portable_hash,
>ascending=True, keyfunc=lambda x: 
> x):
> ```
> And at document, there is following sample script.
> ```
> >>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
> 3)])
> >>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 
> 2)
> ```
> The third argument (ascending) expected to be boolean, so following script is 
> better, I think.
> ```
> >>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
> 3)])
> >>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 
> True)
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21358) variable of repartitionandsortwithinpartitions at pyspark

2017-07-09 Thread chie hayashida (JIRA)
chie hayashida created SPARK-21358:
--

 Summary: variable of repartitionandsortwithinpartitions at pyspark
 Key: SPARK-21358
 URL: https://issues.apache.org/jira/browse/SPARK-21358
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Examples
Affects Versions: 2.1.1
Reporter: chie hayashida
Priority: Minor


In rdd.py, implementation of repartitionandsortwithinpartitions is below.

```
   def repartitionAndSortWithinPartitions(self, numPartitions=None, 
partitionFunc=portable_hash,
   ascending=True, keyfunc=lambda x: x):

```
And at document, there is following sample script.

```
>>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
3)])
>>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 2)
```

The third argument (ascending) expected to be boolean, so following script is 
better, I think.
```
>>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
3)])
>>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 
True)
```




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

2017-01-07 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807077#comment-15807077
 ] 

chie hayashida edited comment on SPARK-17154 at 1/7/17 8:18 AM:


[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

h2. Example 1

{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2| 3|  1| 2| 3|
|  1| 2| 3|  1| 1| 1|
|  1| 4| 5|  1| 4| 5|
|  1| 4| 5|  1| 2| 3|
|  1| 4| 5|  1| 1| 1|
|  2| 2| 4|  2| 8| 8|
|  2| 2| 4|  2| 5| 7|
|  2| 2| 4|  2| 2| 4|
|  2| 5| 7|  2| 8| 8|
|  2| 5| 7|  2| 5| 7|
|  2| 5| 7|  2| 2| 4|
|  2| 8| 8|  2| 8| 8|
|  2| 8| 8|  2| 5| 7|
|  2| 8| 8|  2| 2| 4|
+---+--+--+---+--+--+


scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id#178], Inner, BuildRight
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id#178, value1#179, value2#180]
{code}

h2. Example2
{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = 
df.select($"id".as("id2"),$"value1".as("value11"),$"value2".as("value22"))
df4: org.apache.spark.sql.DataFrame = [id2: int, value11: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id2") && df("value2") <= 
df2("value22"))
df5: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+---+---+
| id|value1|value2|id2|value11|value22|
+---+--+--+---+---+---+
|  1| 1| 1|  1|  4|  5|
|  1| 1| 1|  1|  2|  3|
|  1| 1| 1|  1|  1|  1|
|  1| 2| 3|  1|  4|  5|
|  1| 2| 3|  1|  2|  3|
|  1| 4| 5|  1|  4|  5|
|  2| 2| 4|  2|  8|  8|
|  2| 2| 4|  2|  5|  7|
|  2| 2| 4|  2|  2|  4|
|  2| 5| 7|  2|  8|  8|
|  2| 5| 7|  2|  5|  7|
|  2| 8| 8|  2|  8|  8|
+---+--+--+---+---+---+

scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id2#243], Inner, BuildRight, (value2#173 <= 
value22#245)
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id2#243, value11#244, value22#245]
{code}

The content of df3 are different between Example1 and Example2.

I think reason of this is same as SPARK-17154.

In above case I understand result of Example1 is incollect and that of Example 
2 is collect.
But this issue isn't trivial and some developer may overlook this buggy code, I 
think.
Permanent action should be taken for this issue, I think.


was (Author: hayashidac):
[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

h2. Example 1

{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2| 3|  1| 2|

[jira] [Comment Edited] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

2017-01-07 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807077#comment-15807077
 ] 

chie hayashida edited comment on SPARK-17154 at 1/7/17 8:17 AM:


[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

h2. Example 1

{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2| 3|  1| 2| 3|
|  1| 2| 3|  1| 1| 1|
|  1| 4| 5|  1| 4| 5|
|  1| 4| 5|  1| 2| 3|
|  1| 4| 5|  1| 1| 1|
|  2| 2| 4|  2| 8| 8|
|  2| 2| 4|  2| 5| 7|
|  2| 2| 4|  2| 2| 4|
|  2| 5| 7|  2| 8| 8|
|  2| 5| 7|  2| 5| 7|
|  2| 5| 7|  2| 2| 4|
|  2| 8| 8|  2| 8| 8|
|  2| 8| 8|  2| 5| 7|
|  2| 8| 8|  2| 2| 4|
+---+--+--+---+--+--+


scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id#178], Inner, BuildRight
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id#178, value1#179, value2#180]
{code}

h2. Example2
{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = 
df.select($"id".as("id2"),$"value1".as("value11"),$"value2".as("value22"))
df4: org.apache.spark.sql.DataFrame = [id2: int, value11: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id2") && df("value2") <= 
df2("value22"))
df5: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+---+---+
| id|value1|value2|id2|value11|value22|
+---+--+--+---+---+---+
|  1| 1| 1|  1|  4|  5|
|  1| 1| 1|  1|  2|  3|
|  1| 1| 1|  1|  1|  1|
|  1| 2| 3|  1|  4|  5|
|  1| 2| 3|  1|  2|  3|
|  1| 4| 5|  1|  4|  5|
|  2| 2| 4|  2|  8|  8|
|  2| 2| 4|  2|  5|  7|
|  2| 2| 4|  2|  2|  4|
|  2| 5| 7|  2|  8|  8|
|  2| 5| 7|  2|  5|  7|
|  2| 8| 8|  2|  8|  8|
+---+--+--+---+---+---+

scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id2#243], Inner, BuildRight, (value2#173 <= 
value22#245)
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id2#243, value11#244, value22#245]
{code}

The content of df3 are different between Example1 and Example2.

I think the reason of this is same as SPARK-17154.

In above case I understand result of Example1 is incollect and that of Example 
2 is collect.
But this issue isn't trivial and some developer may overlook this buggy code, I 
think.
Permanent action should be taken for this issue, I think.


was (Author: hayashidac):
[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

h2. Example 1

{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2| 3|  1| 

[jira] [Comment Edited] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

2017-01-07 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807077#comment-15807077
 ] 

chie hayashida edited comment on SPARK-17154 at 1/7/17 8:17 AM:


[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

h2. Example 1

{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2| 3|  1| 2| 3|
|  1| 2| 3|  1| 1| 1|
|  1| 4| 5|  1| 4| 5|
|  1| 4| 5|  1| 2| 3|
|  1| 4| 5|  1| 1| 1|
|  2| 2| 4|  2| 8| 8|
|  2| 2| 4|  2| 5| 7|
|  2| 2| 4|  2| 2| 4|
|  2| 5| 7|  2| 8| 8|
|  2| 5| 7|  2| 5| 7|
|  2| 5| 7|  2| 2| 4|
|  2| 8| 8|  2| 8| 8|
|  2| 8| 8|  2| 5| 7|
|  2| 8| 8|  2| 2| 4|
+---+--+--+---+--+--+


scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id#178], Inner, BuildRight
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id#178, value1#179, value2#180]
{code}

h2 Example2
{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = 
df.select($"id".as("id2"),$"value1".as("value11"),$"value2".as("value22"))
df4: org.apache.spark.sql.DataFrame = [id2: int, value11: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id2") && df("value2") <= 
df2("value22"))
df5: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+---+---+
| id|value1|value2|id2|value11|value22|
+---+--+--+---+---+---+
|  1| 1| 1|  1|  4|  5|
|  1| 1| 1|  1|  2|  3|
|  1| 1| 1|  1|  1|  1|
|  1| 2| 3|  1|  4|  5|
|  1| 2| 3|  1|  2|  3|
|  1| 4| 5|  1|  4|  5|
|  2| 2| 4|  2|  8|  8|
|  2| 2| 4|  2|  5|  7|
|  2| 2| 4|  2|  2|  4|
|  2| 5| 7|  2|  8|  8|
|  2| 5| 7|  2|  5|  7|
|  2| 8| 8|  2|  8|  8|
+---+--+--+---+---+---+

scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id2#243], Inner, BuildRight, (value2#173 <= 
value22#245)
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id2#243, value11#244, value22#245]
{code}

The content of df3 are different between Example1 and Example2.

I think the reason of this is same as SPARK-17154.

In above case I understand result of Example1 is incollect and that of Example 
2 is collect.
But this issue isn't trivial and some developer may overlook this buggy code, I 
think.
Permanent action should be taken for this issue, I think.


was (Author: hayashidac):
[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

# Example 1

{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2| 3|  1| 2|   

[jira] [Comment Edited] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

2017-01-07 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807077#comment-15807077
 ] 

chie hayashida edited comment on SPARK-17154 at 1/7/17 8:14 AM:


[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

# Example 1

{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2| 3|  1| 2| 3|
|  1| 2| 3|  1| 1| 1|
|  1| 4| 5|  1| 4| 5|
|  1| 4| 5|  1| 2| 3|
|  1| 4| 5|  1| 1| 1|
|  2| 2| 4|  2| 8| 8|
|  2| 2| 4|  2| 5| 7|
|  2| 2| 4|  2| 2| 4|
|  2| 5| 7|  2| 8| 8|
|  2| 5| 7|  2| 5| 7|
|  2| 5| 7|  2| 2| 4|
|  2| 8| 8|  2| 8| 8|
|  2| 8| 8|  2| 5| 7|
|  2| 8| 8|  2| 2| 4|
+---+--+--+---+--+--+


scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id#178], Inner, BuildRight
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id#178, value1#179, value2#180]
{code}

# Example2
{code}
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = 
df.select($"id".as("id2"),$"value1".as("value11"),$"value2".as("value22"))
df4: org.apache.spark.sql.DataFrame = [id2: int, value11: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id2") && df("value2") <= 
df2("value22"))
df5: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+---+---+
| id|value1|value2|id2|value11|value22|
+---+--+--+---+---+---+
|  1| 1| 1|  1|  4|  5|
|  1| 1| 1|  1|  2|  3|
|  1| 1| 1|  1|  1|  1|
|  1| 2| 3|  1|  4|  5|
|  1| 2| 3|  1|  2|  3|
|  1| 4| 5|  1|  4|  5|
|  2| 2| 4|  2|  8|  8|
|  2| 2| 4|  2|  5|  7|
|  2| 2| 4|  2|  2|  4|
|  2| 5| 7|  2|  8|  8|
|  2| 5| 7|  2|  5|  7|
|  2| 8| 8|  2|  8|  8|
+---+--+--+---+---+---+

scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id2#243], Inner, BuildRight, (value2#173 <= 
value22#245)
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id2#243, value11#244, value22#245]
{code}

The content of df3 are different between Example1 and Example2.

I think the reason of this is same as SPARK-17154.

In above case I understand result of Example1 is incollect and that of Example 
2 is collect.
But this issue isn't trivial and some developer may overlook this buggy code, I 
think.
Permanent action should be taken for this issue, I think.


was (Author: hayashidac):
[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

# Example 1

``` scala
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2| 3|  1| 2|   

[jira] [Commented] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

2017-01-07 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807077#comment-15807077
 ] 

chie hayashida commented on SPARK-17154:


[~nsyca], [~cloud_fan], [~sarutak]
I have an example code below.

# Example 1

``` scala
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = df
df2: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id") && df("value2") <= 
df2("value2"))
17/01/07 16:29:26 WARN Column: Constructing trivially true equals predicate, 
'id#171 = id#171'. Perhaps you need to use aliases.
df3: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+--+--+
| id|value1|value2| id|value1|value2|
+---+--+--+---+--+--+
|  1| 1| 1|  1| 4| 5|
|  1| 1| 1|  1| 2| 3|
|  1| 1| 1|  1| 1| 1|
|  1| 2| 3|  1| 4| 5|
|  1| 2| 3|  1| 2| 3|
|  1| 2| 3|  1| 1| 1|
|  1| 4| 5|  1| 4| 5|
|  1| 4| 5|  1| 2| 3|
|  1| 4| 5|  1| 1| 1|
|  2| 2| 4|  2| 8| 8|
|  2| 2| 4|  2| 5| 7|
|  2| 2| 4|  2| 2| 4|
|  2| 5| 7|  2| 8| 8|
|  2| 5| 7|  2| 5| 7|
|  2| 5| 7|  2| 2| 4|
|  2| 8| 8|  2| 8| 8|
|  2| 8| 8|  2| 5| 7|
|  2| 8| 8|  2| 2| 4|
+---+--+--+---+--+--+


scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id#178], Inner, BuildRight
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id#178, value1#179, value2#180]
```

# Example2
```scala
scala> val df = 
Seq((1,1,1),(1,2,3),(1,4,5),(2,2,4),(2,5,7),(2,8,8)).toDF("id","value1","value2")
df: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 1 more field]

scala> val df2 = 
df.select($"id".as("id2"),$"value1".as("value11"),$"value2".as("value22"))
df4: org.apache.spark.sql.DataFrame = [id2: int, value11: int ... 1 more field]

scala> val df3 = df.join(df2,df("id") === df2("id2") && df("value2") <= 
df2("value22"))
df5: org.apache.spark.sql.DataFrame = [id: int, value1: int ... 4 more fields]

scala> df3.show
+---+--+--+---+---+---+
| id|value1|value2|id2|value11|value22|
+---+--+--+---+---+---+
|  1| 1| 1|  1|  4|  5|
|  1| 1| 1|  1|  2|  3|
|  1| 1| 1|  1|  1|  1|
|  1| 2| 3|  1|  4|  5|
|  1| 2| 3|  1|  2|  3|
|  1| 4| 5|  1|  4|  5|
|  2| 2| 4|  2|  8|  8|
|  2| 2| 4|  2|  5|  7|
|  2| 2| 4|  2|  2|  4|
|  2| 5| 7|  2|  8|  8|
|  2| 5| 7|  2|  5|  7|
|  2| 8| 8|  2|  8|  8|
+---+--+--+---+---+---+

scala> df3.explain
== Physical Plan ==
*BroadcastHashJoin [id#171], [id2#243], Inner, BuildRight, (value2#173 <= 
value22#245)
:- LocalTableScan [id#171, value1#172, value2#173]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
   +- LocalTableScan [id2#243, value11#244, value22#245]

```

The content of df3 are different between Example1 and Example2.

I think the reason of this is same as SPARK-17154.

In above case I understand result of Example1 is incollect and that of Example 
2 is collect.
But this issue isn't trivial and some developer may overlook this buggy code, I 
think.
Permanent action should be taken for this issue, I think.

> Wrong result can be returned or AnalysisException can be thrown after 
> self-join or similar operations
> -
>
> Key: SPARK-17154
> URL: https://issues.apache.org/jira/browse/SPARK-17154
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Kousuke Saruta
> Attachments: Name-conflicts-2.pdf, Solution_Proposal_SPARK-17154.pdf
>
>
> When we join two DataFrames which are originated from a same DataFrame, 
> operations to the joined DataFrame can fail.
> One reproducible  example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val joined = filtered.join(df, filtered("col1") === df("col1"), "inner")
>   val selected1 = 

[jira] [Closed] (SPARK-18384) explanation of maxMemoryInMB in treeParams at should be written more in API doc

2016-11-09 Thread chie hayashida (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chie hayashida closed SPARK-18384.
--
Resolution: Invalid

It has already been fixed

> explanation of maxMemoryInMB in treeParams at should be written more in API 
> doc 
> 
>
> Key: SPARK-18384
> URL: https://issues.apache.org/jira/browse/SPARK-18384
> Project: Spark
>  Issue Type: Documentation
>Reporter: chie hayashida
>Priority: Minor
>
> explanation of maxMemoryInMB in treeParams is too simple in scala API doc.
> We should mention more about this parameter's effect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18384) explanation of maxMemoryInMB in treeParams at should be written more in API doc

2016-11-09 Thread chie hayashida (JIRA)
chie hayashida created SPARK-18384:
--

 Summary: explanation of maxMemoryInMB in treeParams at should be 
written more in API doc 
 Key: SPARK-18384
 URL: https://issues.apache.org/jira/browse/SPARK-18384
 Project: Spark
  Issue Type: Documentation
Reporter: chie hayashida
Priority: Minor


explanation of maxMemoryInMB in treeParams is too simple in scala API doc.
We should mention more about this parameter's effect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

2016-11-09 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15650402#comment-15650402
 ] 

chie hayashida edited comment on SPARK-17154 at 11/9/16 9:08 AM:
-

I faced this problem. How is the progress?


was (Author: hayashidac):
I'm facing this problem. How is the progress?

> Wrong result can be returned or AnalysisException can be thrown after 
> self-join or similar operations
> -
>
> Key: SPARK-17154
> URL: https://issues.apache.org/jira/browse/SPARK-17154
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Kousuke Saruta
> Attachments: Name-conflicts-2.pdf, Solution_Proposal_SPARK-17154.pdf
>
>
> When we join two DataFrames which are originated from a same DataFrame, 
> operations to the joined DataFrame can fail.
> One reproducible  example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val joined = filtered.join(df, filtered("col1") === df("col1"), "inner")
>   val selected1 = joined.select(df("col3"))
> {code}
> In this case, AnalysisException is thrown.
> Another example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val rightOuterJoined = filtered.join(df, filtered("col1") === df("col1"), 
> "right")
>   val selected2 = rightOuterJoined.select(df("col1"))
>   selected2.show
> {code}
> In this case, we will expect to get the answer like as follows.
> {code}
> 1
> 2
> 3
> 4
> 5
> {code}
> But the actual result is as follows.
> {code}
> 1
> 2
> null
> 4
> 5
> {code}
> The cause of the problems in the examples is that the logical plan related to 
> the right side DataFrame and the expressions of its output are re-created in 
> the analyzer (at ResolveReference rule) when a DataFrame has expressions 
> which have a same exprId each other.
> Re-created expressions are equally to the original ones except exprId.
> This will happen when we do self-join or similar pattern operations.
> In the first example, df("col3") returns a Column which includes an 
> expression and the expression have an exprId (say id1 here).
> After join, the expresion which the right side DataFrame (df) has is 
> re-created and the old and new expressions are equally but exprId is renewed 
> (say id2 for the new exprId here).
> Because of the mismatch of those exprIds, AnalysisException is thrown.
> In the second example, df("col1") returns a column and the expression 
> contained in the column is assigned an exprId (say id3).
> On the other hand, a column returned by filtered("col1") has an expression 
> which has the same exprId (id3).
> After join, the expressions in the right side DataFrame are re-created and 
> the expression assigned id3 is no longer present in the right side but 
> present in the left side.
> So, referring df("col1") to the joined DataFrame, we get col1 of right side 
> which includes null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

2016-11-09 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15650402#comment-15650402
 ] 

chie hayashida commented on SPARK-17154:


I'm facing this problem. How is the progress?

> Wrong result can be returned or AnalysisException can be thrown after 
> self-join or similar operations
> -
>
> Key: SPARK-17154
> URL: https://issues.apache.org/jira/browse/SPARK-17154
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Kousuke Saruta
> Attachments: Name-conflicts-2.pdf, Solution_Proposal_SPARK-17154.pdf
>
>
> When we join two DataFrames which are originated from a same DataFrame, 
> operations to the joined DataFrame can fail.
> One reproducible  example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val joined = filtered.join(df, filtered("col1") === df("col1"), "inner")
>   val selected1 = joined.select(df("col3"))
> {code}
> In this case, AnalysisException is thrown.
> Another example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val rightOuterJoined = filtered.join(df, filtered("col1") === df("col1"), 
> "right")
>   val selected2 = rightOuterJoined.select(df("col1"))
>   selected2.show
> {code}
> In this case, we will expect to get the answer like as follows.
> {code}
> 1
> 2
> 3
> 4
> 5
> {code}
> But the actual result is as follows.
> {code}
> 1
> 2
> null
> 4
> 5
> {code}
> The cause of the problems in the examples is that the logical plan related to 
> the right side DataFrame and the expressions of its output are re-created in 
> the analyzer (at ResolveReference rule) when a DataFrame has expressions 
> which have a same exprId each other.
> Re-created expressions are equally to the original ones except exprId.
> This will happen when we do self-join or similar pattern operations.
> In the first example, df("col3") returns a Column which includes an 
> expression and the expression have an exprId (say id1 here).
> After join, the expresion which the right side DataFrame (df) has is 
> re-created and the old and new expressions are equally but exprId is renewed 
> (say id2 for the new exprId here).
> Because of the mismatch of those exprIds, AnalysisException is thrown.
> In the second example, df("col1") returns a column and the expression 
> contained in the column is assigned an exprId (say id3).
> On the other hand, a column returned by filtered("col1") has an expression 
> which has the same exprId (id3).
> After join, the expressions in the right side DataFrame are re-created and 
> the expression assigned id3 is no longer present in the right side but 
> present in the left side.
> So, referring df("col1") to the joined DataFrame, we get col1 of right side 
> which includes null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13770) Document the ML feature Interaction

2016-10-27 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611044#comment-15611044
 ] 

chie hayashida commented on SPARK-13770:


I added examples and documentation. please check it.

> Document the ML feature Interaction
> ---
>
> Key: SPARK-13770
> URL: https://issues.apache.org/jira/browse/SPARK-13770
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 1.6.0
>Reporter: Abbass Marouni
>Priority: Minor
>
> The ML feature Interaction 
> (http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/Interaction.html)
>  is not included in the documentation of ML features. It'd be nice to provide 
> a working example and some documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16987) Add spark-default.conf property to define https port for spark history server

2016-10-25 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605225#comment-15605225
 ] 

chie hayashida commented on SPARK-16987:


Can I work on this issue?

> Add spark-default.conf property to define https port for spark history server
> -
>
> Key: SPARK-16987
> URL: https://issues.apache.org/jira/browse/SPARK-16987
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Priority: Minor
>
> With SPARK-2750, Spark History server UI becomes accessible on https port.
> Currently, https port is pre-defined to http port + 400. 
> Spark History server UI https port should not be pre-defined but it should be 
> configurable. 
> Thus, spark should to introduce new property to make spark history server 
> https port configurable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-24 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602484#comment-15602484
 ] 

chie hayashida commented on SPARK-16988:


Can I work on this issue?

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Priority: Minor
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-24 Thread chie hayashida (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chie hayashida updated SPARK-16988:
---
Comment: was deleted

(was: Can I work on it?)

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Priority: Minor
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-24 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602442#comment-15602442
 ] 

chie hayashida commented on SPARK-16988:


Can I work on it?

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Priority: Minor
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org