date:20180622

[jira] [Commented] (SPARK-20408) Get glob path in parallel to reduce resolve relation time

2018-06-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520976#comment-16520976
 ] 

Apache Spark commented on SPARK-20408:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/21618

> Get glob path in parallel to reduce resolve relation time
> -
>
> Key: SPARK-20408
> URL: https://issues.apache.org/jira/browse/SPARK-20408
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Li Yuanjian
>Priority: Major
>
> The datasource read from a path with wildcard like below will cause a long 
> time waiting(each star may represent 100~1000 file or dir), especially in a 
> cross region env, driver and hdfs in different region, the drawback will 
> enlarge.
> bq. spark.read.text("/log/product/201704/\*/\*/\*/\*")
> Optimize strategy is same with bulkListLeafFiles in InMemoryFileIndex, get 
> the  wildcard path in parallel.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23710) Upgrade Hive to 2.3.2

2018-06-22 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520967#comment-16520967
 ] 

Hyukjin Kwon commented on SPARK-23710:
--

Yup, agree with that should better be done first and this ticket should be 
targeted to the upper major version.
Was trying to check what's affected here and if it's possible to make the safer 
fix so that probably we can target to 3.0.0.

> Upgrade Hive to 2.3.2
> -
>
> Key: SPARK-23710
> URL: https://issues.apache.org/jira/browse/SPARK-23710
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Critical
>
> h1. Mainly changes
>  * Maven dependency:
>  hive.version from {{1.2.1.spark2}} to {{2.3.2}} and change 
> {{hive.classifier}} to {{core}}
>  calcite.version from {{1.2.0-incubating}} to {{1.10.0}}
>  datanucleus-core.version from {{3.2.10}} to {{4.1.17}}
>  remove {{orc.classifier}}, it means orc use the {{hive.storage.api}}, see: 
> ORC-174
>  add new dependency {{avatica}} and {{hive.storage.api}}
>  * ORC compatibility changes:
>  OrcColumnVector.java, OrcColumnarBatchReader.java, OrcDeserializer.scala, 
> OrcFilters.scala, OrcSerializer.scala, OrcFilterSuite.scala
>  * hive-thriftserver java file update:
>  update {{sql/hive-thriftserver/if/TCLIService.thrift}} to hive 2.3.2
>  update {{sql/hive-thriftserver/src/main/java/org/apache/hive/service/*}} to 
> hive 2.3.2
>  * TestSuite should update:
> ||TestSuite||Reason||
> |StatisticsSuite|HIVE-16098|
> |SessionCatalogSuite|Similar to [VersionsSuite.scala#L427|#L427]|
> |CliSuite, HiveThriftServer2Suites, HiveSparkSubmitSuite, HiveQuerySuite, 
> SQLQuerySuite|Update hive-hcatalog-core-0.13.1.jar to 
> hive-hcatalog-core-2.3.2.jar|
> |SparkExecuteStatementOperationSuite|Interface changed from 
> org.apache.hive.service.cli.Type.NULL_TYPE to 
> org.apache.hadoop.hive.serde2.thrift.Type.NULL_TYPE|
> |ClasspathDependenciesSuite|org.apache.hive.com.esotericsoftware.kryo.Kryo 
> change to com.esotericsoftware.kryo.Kryo|
> |HiveMetastoreCatalogSuite|Result format changed from Seq("1.1\t1", "2.1\t2") 
> to Seq("1.100\t1", "2.100\t2")|
> |HiveOrcFilterSuite|Result format changed|
> |HiveDDLSuite|Remove $ (This change needs to be reconsidered)|
> |HiveExternalCatalogVersionsSuite| java.lang.ClassCastException: 
> org.datanucleus.identity.DatastoreIdImpl cannot be cast to 
> org.datanucleus.identity.OID|
>  * Other changes:
> Close hive schema verification:  
> [HiveClientImpl.scala#L251|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L251]
>  and 
> [HiveExternalCatalog.scala#L58|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L58]
> Update 
> [IsolatedClientLoader.scala#L189-L192|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L189-L192]
> Because Hive 2.3.2's {{org.apache.hadoop.hive.ql.metadata.Hive}} can't 
> connect to Hive 1.x metastore, We should use 
> {{HiveMetaStoreClient.getDelegationToken}} instead of 
> {{Hive.getDelegationToken}} and update {{HiveClientImpl.toHiveTable}}
> All changes can be found at 
> [PR-20659|https://github.com/apache/spark/pull/20659].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24634) Add a new metric regarding number of rows later than watermark

2018-06-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24634:


Assignee: Apache Spark

> Add a new metric regarding number of rows later than watermark
> --
>
> Key: SPARK-24634
> URL: https://issues.apache.org/jira/browse/SPARK-24634
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> Spark filters out late rows which are later than watermark while applying 
> operations which leverage window. While Spark exposes information regarding 
> watermark to StreamingQueryListener, there's no information regarding rows 
> being filtered out due to watermark. The information should help end users to 
> adjust watermark while operating their query.
> We could expose metric regarding number of rows later than watermark and 
> being filtered out. It would be ideal to support side-output to consume late 
> rows, but it doesn't look like easy so addressing this first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24634) Add a new metric regarding number of rows later than watermark

2018-06-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520950#comment-16520950
 ] 

Apache Spark commented on SPARK-24634:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/21617

> Add a new metric regarding number of rows later than watermark
> --
>
> Key: SPARK-24634
> URL: https://issues.apache.org/jira/browse/SPARK-24634
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark filters out late rows which are later than watermark while applying 
> operations which leverage window. While Spark exposes information regarding 
> watermark to StreamingQueryListener, there's no information regarding rows 
> being filtered out due to watermark. The information should help end users to 
> adjust watermark while operating their query.
> We could expose metric regarding number of rows later than watermark and 
> being filtered out. It would be ideal to support side-output to consume late 
> rows, but it doesn't look like easy so addressing this first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24634) Add a new metric regarding number of rows later than watermark

2018-06-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24634:


Assignee: (was: Apache Spark)

> Add a new metric regarding number of rows later than watermark
> --
>
> Key: SPARK-24634
> URL: https://issues.apache.org/jira/browse/SPARK-24634
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark filters out late rows which are later than watermark while applying 
> operations which leverage window. While Spark exposes information regarding 
> watermark to StreamingQueryListener, there's no information regarding rows 
> being filtered out due to watermark. The information should help end users to 
> adjust watermark while operating their query.
> We could expose metric regarding number of rows later than watermark and 
> being filtered out. It would be ideal to support side-output to consume late 
> rows, but it doesn't look like easy so addressing this first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24634) Add a new metric regarding number of rows later than watermark

2018-06-22 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520948#comment-16520948
 ] 

Jungtaek Lim commented on SPARK-24634:
--

Working on this. Will submit a patch soon.

> Add a new metric regarding number of rows later than watermark
> --
>
> Key: SPARK-24634
> URL: https://issues.apache.org/jira/browse/SPARK-24634
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark filters out late rows which are later than watermark while applying 
> operations which leverage window. While Spark exposes information regarding 
> watermark to StreamingQueryListener, there's no information regarding rows 
> being filtered out due to watermark. The information should help end users to 
> adjust watermark while operating their query.
> We could expose metric regarding number of rows later than watermark and 
> being filtered out. It would be ideal to support side-output to consume late 
> rows, but it doesn't look like easy so addressing this first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24634) Add a new metric regarding number of rows later than watermark

2018-06-22 Thread Jungtaek Lim (JIRA)

Jungtaek Lim created SPARK-24634:


 Summary: Add a new metric regarding number of rows later than 
watermark
 Key: SPARK-24634
 URL: https://issues.apache.org/jira/browse/SPARK-24634
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jungtaek Lim


Spark filters out late rows which are later than watermark while applying 
operations which leverage window. While Spark exposes information regarding 
watermark to StreamingQueryListener, there's no information regarding rows 
being filtered out due to watermark. The information should help end users to 
adjust watermark while operating their query.

We could expose metric regarding number of rows later than watermark and being 
filtered out. It would be ideal to support side-output to consume late rows, 
but it doesn't look like easy so addressing this first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24633) arrays_zip function's code generator splits input processing incorrectly

2018-06-22 Thread Bruce Robbins (JIRA)

Bruce Robbins created SPARK-24633:
-

 Summary: arrays_zip function's code generator splits input 
processing incorrectly
 Key: SPARK-24633
 URL: https://issues.apache.org/jira/browse/SPARK-24633
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
 Environment: Mac OS High Sierra
Reporter: Bruce Robbins


This works:
{noformat}
scala> val df = spark.read.parquet("many_arrays_per_row")
df: org.apache.spark.sql.DataFrame = [k0: array, k1: array ... 
98 more fields]
scala> df.selectExpr("arrays_zip(k0, k1, k2)").show(truncate=false)
++
|arrays_zip(k0, k1, k2)  |
++
|[[6583, 1312, 7460], [668, 1626, 4129]] |
|[[5415, 5251, 1514], [1631, 2224, 2553]]|
++
{noformat}
If I add one more array to the parameter list, I get this:
{noformat}
scala> df.selectExpr("arrays_zip(k0, k1, k2, k3)").show(truncate=false)
18/06/22 18:06:41 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 92, 
Column 35: Unknown variable or type "scan_row_0"
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 92, 
Column 35: Unknown variable or type "scan_row_0"
at 
org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6521)
at org.codehaus.janino.UnitCompiler.access$13100(UnitCompiler.java:212)
at 
org.codehaus.janino.UnitCompiler$18.visitPackage(UnitCompiler.java:6133)
.. much exception trace...
18/06/22 18:06:41 WARN WholeStageCodegenExec: Whole-stage codegen disabled for 
plan (id=1):
 *(1) LocalLimit 21
+- *(1) Project [cast(arrays_zip(k0#375, k1#376, k2#387, k3#398) as string) AS 
arrays_zip(k0, k1, k2, k3)#619]
   +- *(1) FileScan parquet [k0#375,k1#376,k2#387,k3#398] Batched: false, 
Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/brobbins/github/spark_upstream/many_arrays_per_row],
 PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct,k1:array,k2:array,k3:array>

++
|arrays_zip(k0, k1, k2, k3)  |
++
|[[6583, 1312, 7460, 3712], [668, 1626, 4129, 2815]] |
|[[5415, 5251, 1514, 1580], [1631, 2224, 2553, 7555]]|
++
{noformat}
I still got the answer!

I add a 5th parameter:
{noformat}
scala> df.selectExpr("arrays_zip(k0, k1, k2, k3, k4)").show(truncate=false)
18/06/22 18:07:53 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 97, 
Column 35: Unknown variable or type "scan_row_0"
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 97, 
Column 35: Unknown variable or type "scan_row_0"
at 
org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6521)
at org.codehaus.janino.UnitCompiler.access$13100(UnitCompiler.java:212)
at 
org.codehaus.janino.UnitCompiler$18.visitPackage(UnitCompiler.java:6133)
at 
org.codehaus.janino.UnitCompiler$18.visitPackage(UnitCompiler.java:6130)
at org.codehaus.janino.Java$Package.accept(Java.java:4077)
.. much exception trace...
Caused by: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 73, Column 21: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 73, 
Column 21: Unknown variable or type "i"
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scal\
a:1361)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1423)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1420)
  at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
  at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
  ... 31 more
scala> 
{noformat}
This time, no result.

It looks like the generated code is expecting the input row to be in a 
parameter (either i or scan_row_x), but that parameter is not passed to the 
input handler function (see lines 069 and 073)
{noformat}
/* 069 */   private int getValuesAndCardinalities_0_1(ArrayData[] arrVals_0, 
int biggestCardinality_0) {
/* 070 */
/* 071 */
/* 072 */ if (biggestCardinality_0 != -1) {
/* 073 */   boolean isNull_6 = i.isNullAt(4);
/* 074 */   ArrayData value_6 = isNull_6 ?
/* 075 */   null : (i.getArray(4));
/* 076 */   if (!isNull_6) {
/* 077

[jira] [Commented] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate

2018-06-22 Thread Sivakumar (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520879#comment-16520879
 ] 

Sivakumar commented on SPARK-24631:
---

Its resolved, Actually I was trying to query a view. Recreate d the view and it 
worked.

> Cannot up cast column from bigint to smallint as it may truncate
> 
>
> Key: SPARK-24631
> URL: https://issues.apache.org/jira/browse/SPARK-24631
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Sivakumar
>Priority: Major
>
> Getting the below error when executing the simple select query,
> Sample:
> Table Description:
> name: String, id: BigInt
> val df=spark.sql("select name,id from testtable")
> ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as 
> it may truncate.{color}
> I am not doing any transformation's, I am just trying to query a table ,But 
> still I am getting the error.
> I am getting this error only on production cluster and only for a single 
> table, other tables are running fine.
> + more data,
> val df=spark.sql("select* from table_name")
> I am just trying this query a table. But with other tables it is running fine.
> {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: 
> org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from 
> bigint to column_name#2525: smallint as it may truncate.{color}
> that specific column is having Bigint datatype, But there were other table's 
> that ran fine with Bigint columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23704) PySpark access of individual trees in random forest is slow

2018-06-22 Thread Seth Hendrickson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520832#comment-16520832
 ] 

Seth Hendrickson commented on SPARK-23704:
--

Instead of
{code:java}
model.trees[0].transform(test_feat).select('rowNum','probability'){code}
Can you try
{code:java}
trees = model.trees
trees[0].transform(test_feat).select('rowNum','probability'){code}
And time only the second line? The first line actually calls into the JVM and 
creates new trees in Python.

> PySpark access of individual trees in random forest is slow
> ---
>
> Key: SPARK-23704
> URL: https://issues.apache.org/jira/browse/SPARK-23704
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.1
> Environment: PySpark 2.2.1 / Windows 10
>Reporter: Julian King
>Priority: Minor
>
> Making predictions from a randomForestClassifier PySpark is much faster than 
> making predictions from an individual tree contained within the .trees 
> attribute. 
> In fact, the model.transform call without an action is more than 10x slower 
> for an individual tree vs the model.transform call for the random forest 
> model.
> See 
> [https://stackoverflow.com/questions/49297470/slow-individual-tree-access-for-random-forest-in-pyspark]
>  for example with timing.
> Ideally:
>  * Getting a prediction from a single tree should be comparable to or faster 
> than getting predictions from the whole tree
>  * Getting all the predictions from all the individual trees should be 
> comparable in speed to getting the predictions from the random forest
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24532) HiveExternalCatalogVersionSuite should be resilient to missing versions

2018-06-22 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24532.

Resolution: Won't Fix

I've added documentation on the release docs about this test, so RMs know it 
has to be checked.

I played with code for this (which I'll leave at 
https://github.com/vanzin/spark/tree/SPARK-24532), but it seems like too much 
code, and a little too brittle for my taste, to avoid a little work when a new 
release goes out.

> HiveExternalCatalogVersionSuite should be resilient to missing versions
> ---
>
> Key: SPARK-24532
> URL: https://issues.apache.org/jira/browse/SPARK-24532
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> See SPARK-24531.
> As part of our release process we clean up older releases from the mirror 
> network. That causes this test to start failing.
> The test should be more resilient to this; either ignore releases that are 
> not found, or fallback to the ASF archive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate

2018-06-22 Thread vaquar khan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520785#comment-16520785
 ] 

vaquar khan commented on SPARK-24631:
-

You just need to cast column , issue will be resolved {{}}

> Cannot up cast column from bigint to smallint as it may truncate
> 
>
> Key: SPARK-24631
> URL: https://issues.apache.org/jira/browse/SPARK-24631
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Sivakumar
>Priority: Major
>
> Getting the below error when executing the simple select query,
> Sample:
> Table Description:
> name: String, id: BigInt
> val df=spark.sql("select name,id from testtable")
> ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as 
> it may truncate.{color}
> I am not doing any transformation's, I am just trying to query a table ,But 
> still I am getting the error.
> I am getting this error only on production cluster and only for a single 
> table, other tables are running fine.
> + more data,
> val df=spark.sql("select* from table_name")
> I am just trying this query a table. But with other tables it is running fine.
> {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: 
> org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from 
> bigint to column_name#2525: smallint as it may truncate.{color}
> that specific column is having Bigint datatype, But there were other table's 
> that ran fine with Bigint columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22897) Expose stageAttemptId in TaskContext

2018-06-22 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-22897:
---
Fix Version/s: 2.1.3

> Expose  stageAttemptId in TaskContext
> -
>
> Key: SPARK-22897
> URL: https://issues.apache.org/jira/browse/SPARK-22897
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>Priority: Minor
> Fix For: 2.1.3, 2.2.2, 2.3.0
>
>
> Currently, there's no easy way for Executor to detect a new stage is launched 
> as stageAttemptId is missing. 
> I'd like to propose exposing stageAttemptId in TaskContext, and will send a 
> pr if community thinks it's a good thing.
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24589) OutputCommitCoordinator may allow duplicate commits

2018-06-22 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-24589:
---
Fix Version/s: 2.1.3

> OutputCommitCoordinator may allow duplicate commits
> ---
>
> Key: SPARK-24589
> URL: https://issues.apache.org/jira/browse/SPARK-24589
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.1.3, 2.2.2, 2.3.2, 2.4.0
>
>
> This is a sibling bug to SPARK-24552. While investigating the source of that 
> bug, it was found that currently the output committer allows duplicate 
> commits when there are stage retries, and the task with the task attempt 
> number (one in each stage that currently has running tasks) try to commit 
> their output.
> This can lead to duplicate data in the output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520752#comment-16520752
 ] 

Apache Spark commented on SPARK-24552:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21616

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0, 2.2.1, 2.3.0, 2.3.1
>Reporter: Ryan Blue
>Priority: Blocker
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520742#comment-16520742
 ] 

Apache Spark commented on SPARK-24552:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21615

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0, 2.2.1, 2.3.0, 2.3.1
>Reporter: Ryan Blue
>Priority: Blocker
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec

2018-06-22 Thread nirav patel (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520715#comment-16520715
 ] 

nirav patel commented on SPARK-14922:
-

Hi Any updates on this? Is there any workaround meanwhile? Is it possible to 
append empty dataframe in those partitions?

> Alter Table Drop Partition Using Predicate-based Partition Spec
> ---
>
> Key: SPARK-14922
> URL: https://issues.apache.org/jira/browse/SPARK-14922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.2, 2.2.1
>Reporter: Xiao Li
>Priority: Major
>
> Below is allowed in Hive, but not allowed in Spark.
> {noformat}
> alter table ptestfilter drop partition (c='US', d<'2')
> {noformat}
> This example is copied from drop_partitions_filter.q



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

2018-06-22 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-24632:
-

Assignee: Joseph K. Bradley

> Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers 
> for persistence
> --
>
> Key: SPARK-24632
> URL: https://issues.apache.org/jira/browse/SPARK-24632
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> This is a follow-up for [SPARK-17025], which allowed users to implement 
> Python PipelineStages in 3rd-party libraries, include them in Pipelines, and 
> use Pipeline persistence.  This task is to make it easier for 3rd-party 
> libraries to have PipelineStages written in Java and then to use pyspark.ml 
> abstractions to create wrappers around those Java classes.  This is currently 
> possible, except that users hit bugs around persistence.
> Some fixes we'll need include:
> * an overridable method for converting between Python and Java classpaths. 
> See 
> https://github.com/apache/spark/blob/b56e9c613fb345472da3db1a567ee129621f6bf3/python/pyspark/ml/util.py#L284
> * 
> https://github.com/apache/spark/blob/4e7d8678a3d9b12797d07f5497e0ed9e471428dd/python/pyspark/ml/pipeline.py#L378
> One unusual thing for this task will be to write unit tests which test a 
> custom PipelineStage written outside of the pyspark package.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

2018-06-22 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24632:
--
Description: 
This is a follow-up for [SPARK-17025], which allowed users to implement Python 
PipelineStages in 3rd-party libraries, include them in Pipelines, and use 
Pipeline persistence.  This task is to make it easier for 3rd-party libraries 
to have PipelineStages written in Java and then to use pyspark.ml abstractions 
to create wrappers around those Java classes.  This is currently possible, 
except that users hit bugs around persistence.

Some fixes we'll need include:
* an overridable method for converting between Python and Java classpaths. See 
https://github.com/apache/spark/blob/b56e9c613fb345472da3db1a567ee129621f6bf3/python/pyspark/ml/util.py#L284
* 
https://github.com/apache/spark/blob/4e7d8678a3d9b12797d07f5497e0ed9e471428dd/python/pyspark/ml/pipeline.py#L378

One unusual thing for this task will be to write unit tests which test a custom 
PipelineStage written outside of the pyspark package.

  was:
This is a follow-up for [SPARK-17025], which allowed users to implement Python 
PipelineStages in 3rd-party libraries, include them in Pipelines, and use 
Pipeline persistence.  This task is to make it easier for 3rd-party libraries 
to have PipelineStages written in Java and then to use pyspark.ml abstractions 
to create wrappers around those Java classes.  This is currently possible, 
except that users hit bugs around persistence.

One fix we'll need is an overridable method for converting between Python and 
Java classpaths. See 
https://github.com/apache/spark/blob/b56e9c613fb345472da3db1a567ee129621f6bf3/python/pyspark/ml/util.py#L284

One unusual thing for this task will be to write unit tests which test a custom 
PipelineStage written outside of the pyspark package.


> Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers 
> for persistence
> --
>
> Key: SPARK-24632
> URL: https://issues.apache.org/jira/browse/SPARK-24632
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is a follow-up for [SPARK-17025], which allowed users to implement 
> Python PipelineStages in 3rd-party libraries, include them in Pipelines, and 
> use Pipeline persistence.  This task is to make it easier for 3rd-party 
> libraries to have PipelineStages written in Java and then to use pyspark.ml 
> abstractions to create wrappers around those Java classes.  This is currently 
> possible, except that users hit bugs around persistence.
> Some fixes we'll need include:
> * an overridable method for converting between Python and Java classpaths. 
> See 
> https://github.com/apache/spark/blob/b56e9c613fb345472da3db1a567ee129621f6bf3/python/pyspark/ml/util.py#L284
> * 
> https://github.com/apache/spark/blob/4e7d8678a3d9b12797d07f5497e0ed9e471428dd/python/pyspark/ml/pipeline.py#L378
> One unusual thing for this task will be to write unit tests which test a 
> custom PipelineStage written outside of the pyspark package.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming

2018-06-22 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21926.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Marking fix version as 2.3.0 since the issues were fixed in 2.3, even though 
tests were not completed in time for 2.3.

> Compatibility between ML Transformers and Structured Streaming
> --
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Priority: Major
> Fix For: 2.3.0
>
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> More details here SPARK-22346.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> -b) Allow user to set the cardinality of OneHotEncoder.-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-22 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24465:
--
Description: 
Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
MinHashLSHModel) are not compatible with Structured Streaming.

This task is to add unit tests for streaming (as in [SPARK-22644]) for 
LSHModels after [SPARK-12878] has been fixed.

  was:
Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
are the final Transformers which are not compatible).  These do not work 
because Spark SQL does not support nested types containing UDTs, but LSH .

This task is to add unit tests for streaming (as in [SPARK-22644]) for 
LSHModels after [SPARK-12878] has been fixed.


> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
> MinHashLSHModel) are not compatible with Structured Streaming.
> This task is to add unit tests for streaming (as in [SPARK-22644]) for 
> LSHModels after [SPARK-12878] has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-22 Thread Joseph K. Bradley (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520677#comment-16520677
 ] 

Joseph K. Bradley edited comment on SPARK-24465 at 6/22/18 6:39 PM:


Oh actually I think I made this by mistake or forgot to update it?  I fixed 
this in [SPARK-22883]


was (Author: josephkb):
Oh actually I think I made this by mistake?  I fixed this in [SPARK-22883]

> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
> MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
> are the final Transformers which are not compatible).  These do not work 
> because Spark SQL does not support nested types containing UDTs, but LSH .
> This task is to add unit tests for streaming (as in [SPARK-22644]) for 
> LSHModels after [SPARK-12878] has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-22 Thread Joseph K. Bradley (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520677#comment-16520677
 ] 

Joseph K. Bradley commented on SPARK-24465:
---

Oh actually I think I made this by mistake?  I fixed this in [SPARK-22883]

> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
> MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
> are the final Transformers which are not compatible).  These do not work 
> because Spark SQL does not support nested types containing UDTs, but LSH .
> This task is to add unit tests for streaming (as in [SPARK-22644]) for 
> LSHModels after [SPARK-12878] has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-22 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-24465.
---
   Resolution: Fixed
 Assignee: Joseph K. Bradley
Fix Version/s: 2.3.1
   2.4.0

> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
> MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
> are the final Transformers which are not compatible).  These do not work 
> because Spark SQL does not support nested types containing UDTs, but LSH .
> This task is to add unit tests for streaming (as in [SPARK-22644]) for 
> LSHModels after [SPARK-12878] has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-22 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24465:
--
Description: 
Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
are the final Transformers which are not compatible).  These do not work 
because Spark SQL does not support nested types containing UDTs, but LSH .

This task is to add unit tests for streaming (as in [SPARK-22644]) for 
LSHModels after [SPARK-12878] has been fixed.

  was:
Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
are the final Transformers which are not compatible).  These do not work 
because Spark SQL does not support nested types containing UDTs; see 
[SPARK-12878].

This task is to add unit tests for streaming (as in [SPARK-22644]) for 
LSHModels after [SPARK-12878] has been fixed.


> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
> MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
> are the final Transformers which are not compatible).  These do not work 
> because Spark SQL does not support nested types containing UDTs, but LSH .
> This task is to add unit tests for streaming (as in [SPARK-22644]) for 
> LSHModels after [SPARK-12878] has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-22 Thread Joseph K. Bradley (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520671#comment-16520671
 ] 

Joseph K. Bradley commented on SPARK-24465:
---

You're right; I did not read [SPARK-12878] carefully enough.  I'll update this 
JIRA's description to be more specific.

> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
> MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
> are the final Transformers which are not compatible).  These do not work 
> because Spark SQL does not support nested types containing UDTs; see 
> [SPARK-12878].
> This task is to add unit tests for streaming (as in [SPARK-22644]) for 
> LSHModels after [SPARK-12878] has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12878) Dataframe fails with nested User Defined Types

2018-06-22 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-12878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12878:
--
Description: 
Spark 1.6.0 crashes when using nested User Defined Types in a Dataframe. 
In version 1.5.2 the code below worked just fine:

{code}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
import org.apache.spark.sql.types._

@SQLUserDefinedType(udt = classOf[AUDT])
case class A(list:Seq[B])

class AUDT extends UserDefinedType[A] {
  override def sqlType: DataType = StructType(Seq(StructField("list", 
ArrayType(BUDT, containsNull = false), nullable = true)))
  override def userClass: Class[A] = classOf[A]
  override def serialize(obj: Any): Any = obj match {
case A(list) =>
  val row = new GenericMutableRow(1)
  row.update(0, new GenericArrayData(list.map(_.asInstanceOf[Any]).toArray))
  row
  }

  override def deserialize(datum: Any): A = {
datum match {
  case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq)
}
  }
}

object AUDT extends AUDT

@SQLUserDefinedType(udt = classOf[BUDT])
case class B(text:Int)

class BUDT extends UserDefinedType[B] {
  override def sqlType: DataType = StructType(Seq(StructField("num", 
IntegerType, nullable = false)))
  override def userClass: Class[B] = classOf[B]
  override def serialize(obj: Any): Any = obj match {
case B(text) =>
  val row = new GenericMutableRow(1)
  row.setInt(0, text)
  row
  }

  override def deserialize(datum: Any): B = {
datum match {  case row: InternalRow => new B(row.getInt(0))  }
  }
}

object BUDT extends BUDT

object Test {
  def main(args:Array[String]) = {

val col = Seq(new A(Seq(new B(1), new B(2))),
  new A(Seq(new B(3), new B(4

val sc = new SparkContext(new 
SparkConf().setMaster("local[1]").setAppName("TestSpark"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

val df = sc.parallelize(1 to 2 zip col).toDF("id","b")
df.select("b").show()
df.collect().foreach(println)
  }
}
{code}

In the new version (1.6.0) I needed to include the following import:

`import org.apache.spark.sql.catalyst.expressions.GenericMutableRow`

However, Spark crashes in runtime:

{code}
16/01/18 14:36:22 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to 
org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:51)
at 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java

[jira] [Commented] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries

2018-06-22 Thread Joseph K. Bradley (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520666#comment-16520666
 ] 

Joseph K. Bradley commented on SPARK-19498:
---

Sure, comments are welcome!  Or links to JIRAs, whichever are easier.

> Discussion: Making MLlib APIs extensible for 3rd party libraries
> 
>
> Key: SPARK-19498
> URL: https://issues.apache.org/jira/browse/SPARK-19498
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Per the recent discussion on the dev list, this JIRA is for discussing how we 
> can make MLlib DataFrame-based APIs more extensible, especially for the 
> purpose of writing 3rd-party libraries with APIs extended from the MLlib APIs 
> (for custom Transformers, Estimators, etc.).
> * For people who have written such libraries, what issues have you run into?
> * What APIs are not public or extensible enough?  Do they require changes 
> before being made more public?
> * Are APIs for non-Scala languages such as Java and Python friendly or 
> extensive enough?
> The easy answer is to make everything public, but that would be terrible of 
> course in the long-term.  Let's discuss what is needed and how we can present 
> stable, sufficient, and easy-to-use APIs for 3rd-party developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22666) Spark datasource for image format

2018-06-22 Thread Jayesh lalwani (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520657#comment-16520657
 ] 

Jayesh lalwani commented on SPARK-22666:


I'll try to take this on

> Spark datasource for image format
> -
>
> Key: SPARK-22666
> URL: https://issues.apache.org/jira/browse/SPARK-22666
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Priority: Major
>
> The current API for the new image format is implemented as a standalone 
> feature, in order to make it reside within the mllib package. As discussed in 
> SPARK-21866, users should be able to load images through the more common 
> spark source reader interface.
> This ticket is concerned with adding image reading support in the spark 
> source API, through either of the following interfaces:
>  - {{spark.read.format("image")...}}
>  - {{spark.read.image}}
> The output is a dataframe that contains images (and the file names for 
> example), following the semantics discussed already in SPARK-21866.
> A few technical notes:
> * since the functionality is implemented in {{mllib}}, calling this function 
> may fail at runtime if users have not imported the {{spark-mllib}} dependency
> * How to deal with very flat directories? It is common to have millions of 
> files in a single "directory" (like in S3), which seems to have caused some 
> issues to some users. If this issue is too complex to handle in this ticket, 
> it can be dealt with separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2018-06-22 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17025:
--
Comment: was deleted

(was: Thank you for your e-mail. I am on businees travel until Monday 2nd July 
so expect delays in my response.

Many thanks,

Pete

Dr Peter Knight
Sr Staff Analytics Engineer| UK Data Science
GE Aviation
AutoExtReply
)

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Assignee: Ajay Saini
>Priority: Minor
> Fix For: 2.3.0
>
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2018-06-22 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-17025.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Fixed by linked JIRAs

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
> Fix For: 2.3.0
>
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2018-06-22 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-17025:
-

Assignee: Ajay Saini

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Assignee: Ajay Saini
>Priority: Minor
> Fix For: 2.3.0
>
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

2018-06-22 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-24632:
-

 Summary: Allow 3rd-party libraries to use pyspark.ml abstractions 
for Java wrappers for persistence
 Key: SPARK-24632
 URL: https://issues.apache.org/jira/browse/SPARK-24632
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.4.0
Reporter: Joseph K. Bradley


This is a follow-up for [SPARK-17025], which allowed users to implement Python 
PipelineStages in 3rd-party libraries, include them in Pipelines, and use 
Pipeline persistence.  This task is to make it easier for 3rd-party libraries 
to have PipelineStages written in Java and then to use pyspark.ml abstractions 
to create wrappers around those Java classes.  This is currently possible, 
except that users hit bugs around persistence.

One fix we'll need is an overridable method for converting between Python and 
Java classpaths. See 
https://github.com/apache/spark/blob/b56e9c613fb345472da3db1a567ee129621f6bf3/python/pyspark/ml/util.py#L284

One unusual thing for this task will be to write unit tests which test a custom 
PipelineStage written outside of the pyspark package.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2018-06-22 Thread Peter Knight (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520648#comment-16520648
 ] 

Peter Knight commented on SPARK-17025:
--

Thank you for your e-mail. I am on businees travel until Monday 2nd July so 
expect delays in my response.

Many thanks,

Pete

Dr Peter Knight
Sr Staff Analytics Engineer| UK Data Science
GE Aviation
AutoExtReply


> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2018-06-22 Thread Joseph K. Bradley (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520646#comment-16520646
 ] 

Joseph K. Bradley commented on SPARK-17025:
---

We've tested it with Python-only implementations, and it works.  You end up 
hitting problems if you use the (sort of private) Python abstractions for Java 
wrappers like JavaMLWritable.  I'll create a follow-up issue for that and close 
this one as fixed.

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4591) Algorithm/model parity for spark.ml (Scala)

2018-06-22 Thread Joseph K. Bradley (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520635#comment-16520635
 ] 

Joseph K. Bradley commented on SPARK-4591:
--

There are still a few contained tasks which are incomplete.  I'd like to leave 
this open for now.

> Algorithm/model parity for spark.ml (Scala)
> ---
>
> Key: SPARK-4591
> URL: https://issues.apache.org/jira/browse/SPARK-4591
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This is an umbrella JIRA for porting spark.mllib implementations to use the 
> DataFrame-based API defined under spark.ml.  We want to achieve critical 
> feature parity for the next release.
> h3. Instructions for 3 subtask types
> *Review tasks*: detailed review of a subpackage to identify feature gaps 
> between spark.mllib and spark.ml.
> * Should be listed as a subtask of this umbrella.
> * Review subtasks cover major algorithm groups.  To pick up a review subtask, 
> please:
> ** Comment that you are working on it.
> ** Compare the public APIs of spark.ml vs. spark.mllib.
> ** Comment on all missing items within spark.ml: algorithms, models, methods, 
> features, etc.
> ** Check for existing JIRAs covering those items.  If there is no existing 
> JIRA, create one, and link it to your comment.
> *Critical tasks*: higher priority missing features which are required for 
> this umbrella JIRA.
> * Should be linked as "requires" links.
> *Other tasks*: lower priority missing features which can be completed after 
> the critical tasks.
> * Should be linked as "contains" links.
> h4. Excluded items
> This does *not* include:
> * Python: We can compare Scala vs. Python in spark.ml itself.
> * Moving linalg to spark.ml: [SPARK-13944]
> * Streaming ML: Requires stabilizing some internal APIs of structured 
> streaming first
> h3. TODO list
> *Critical issues*
> * [SPARK-14501]: Frequent Pattern Mining
> * [SPARK-14709]: linear SVM
> * [SPARK-15784]: Power Iteration Clustering (PIC)
> *Lower priority issues*
> * Missing methods within algorithms (see Issue Links below)
> * evaluation submodule
> * stat submodule (should probably be covered in DataFrames)
> * Developer-facing submodules:
> ** optimization (including [SPARK-17136])
> ** random, rdd
> ** util
> *To be prioritized*
> * single-instance prediction: [SPARK-10413]
> * pmml [SPARK-11171]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11107) spark.ml should support more input column types: umbrella

2018-06-22 Thread Joseph K. Bradley (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520634#comment-16520634
 ] 

Joseph K. Bradley commented on SPARK-11107:
---

There are still lots of Transformers and Estimators which should support more 
types, but I"m OK closing this since I don't have time to work on it.

> spark.ml should support more input column types: umbrella
> -
>
> Key: SPARK-11107
> URL: https://issues.apache.org/jira/browse/SPARK-11107
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella for expanding the set of data types which spark.ml 
> Pipeline stages can take.  This should not involve breaking APIs, but merely 
> involve slight changes such as supporting all Numeric types instead of just 
> Double.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11107) spark.ml should support more input column types: umbrella

2018-06-22 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-11107.
---
Resolution: Done

> spark.ml should support more input column types: umbrella
> -
>
> Key: SPARK-11107
> URL: https://issues.apache.org/jira/browse/SPARK-11107
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella for expanding the set of data types which spark.ml 
> Pipeline stages can take.  This should not involve breaking APIs, but merely 
> involve slight changes such as supporting all Numeric types instead of just 
> Double.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24372) Create script for preparing RCs

2018-06-22 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-24372.
---
   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.4.0

> Create script for preparing RCs
> ---
>
> Key: SPARK-24372
> URL: https://issues.apache.org/jira/browse/SPARK-24372
> Project: Spark
>  Issue Type: New Feature
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, when preparing RCs, the RM has to invoke many scripts manually, 
> make sure that is being done in the correct environment, and set all the 
> correct environment variables, which differ from one script to the other.
> It will be much easier for RMs if all that was automated as much as possible.
> I'm working on something like this as part of releasing 2.3.1, and plan to 
> send my scripts for review after the release is done (i.e. after I make sure 
> the scripts are working properly).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24518) Using Hadoop credential provider API to store password

2018-06-22 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24518.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21548
[https://github.com/apache/spark/pull/21548]

> Using Hadoop credential provider API to store password
> --
>
> Key: SPARK-24518
> URL: https://issues.apache.org/jira/browse/SPARK-24518
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.4.0
>
>
> Current Spark configs password in a plaintext way, like putting in the 
> configuration file or adding as a launch arguments, sometimes such 
> configurations like SSL password is configured by cluster admin, which should 
> not be seen by user, but now this passwords are world readable to all the 
> users.
> Hadoop credential provider API support storing password in a secure way, in 
> which Spark could read it in a secure way, so here propose to add support of 
> using credential provider API to get password.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24518) Using Hadoop credential provider API to store password

2018-06-22 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24518:
--

Assignee: Saisai Shao

> Using Hadoop credential provider API to store password
> --
>
> Key: SPARK-24518
> URL: https://issues.apache.org/jira/browse/SPARK-24518
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.4.0
>
>
> Current Spark configs password in a plaintext way, like putting in the 
> configuration file or adding as a launch arguments, sometimes such 
> configurations like SSL password is configured by cluster admin, which should 
> not be seen by user, but now this passwords are world readable to all the 
> users.
> Hadoop credential provider API support storing password in a secure way, in 
> which Spark could read it in a secure way, so here propose to add support of 
> using credential provider API to get password.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24130) Data Source V2: Join Push Down

2018-06-22 Thread Dilip Biswal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520587#comment-16520587
 ] 

Dilip Biswal commented on SPARK-24130:
--

[~Shurap1] We are currently waiting for feedback from the community on how to 
proceed. I think we need a V2 implementation of JDBC datasource before we can 
proceed on the pushdown.

> Data Source V2: Join Push Down
> --
>
> Key: SPARK-24130
> URL: https://issues.apache.org/jira/browse/SPARK-24130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jia Li
>Priority: Major
> Attachments: Data Source V2 Join Push Down.pdf
>
>
> Spark applications often directly query external data sources such as 
> relational databases, or files. Spark provides Data Sources APIs for 
> accessing structured data through Spark SQL. Data Sources APIs in both V1 and 
> V2 support optimizations such as Filter push down and Column pruning which 
> are subset of the functionality that can be pushed down to some data sources. 
> We’re proposing to extend Data Sources APIs with join push down (JPD). Join 
> push down significantly improves query performance by reducing the amount of 
> data transfer and exploiting the capabilities of the data sources such as 
> index access.
> Join push down design document is available 
> [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-24130) Data Source V2: Join Push Down

2018-06-22 Thread Dilip Biswal (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dilip Biswal updated SPARK-24130:
-
Comment: was deleted

(was: [~Shurap1] We are currently waiting for feedback from the community on 
how to proceed. I think we need a V2 implementation of JDBC datasource before 
we can proceed on the pushdown.)

> Data Source V2: Join Push Down
> --
>
> Key: SPARK-24130
> URL: https://issues.apache.org/jira/browse/SPARK-24130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jia Li
>Priority: Major
> Attachments: Data Source V2 Join Push Down.pdf
>
>
> Spark applications often directly query external data sources such as 
> relational databases, or files. Spark provides Data Sources APIs for 
> accessing structured data through Spark SQL. Data Sources APIs in both V1 and 
> V2 support optimizations such as Filter push down and Column pruning which 
> are subset of the functionality that can be pushed down to some data sources. 
> We’re proposing to extend Data Sources APIs with join push down (JPD). Join 
> push down significantly improves query performance by reducing the amount of 
> data transfer and exploiting the capabilities of the data sources such as 
> index access.
> Join push down design document is available 
> [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24130) Data Source V2: Join Push Down

2018-06-22 Thread Dilip Biswal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520581#comment-16520581
 ] 

Dilip Biswal commented on SPARK-24130:
--

[~Shurap1] We are currently waiting for feedback from the community on how to 
proceed. I think we need a V2 implementation of JDBC datasource before we can 
proceed on the pushdown.

> Data Source V2: Join Push Down
> --
>
> Key: SPARK-24130
> URL: https://issues.apache.org/jira/browse/SPARK-24130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jia Li
>Priority: Major
> Attachments: Data Source V2 Join Push Down.pdf
>
>
> Spark applications often directly query external data sources such as 
> relational databases, or files. Spark provides Data Sources APIs for 
> accessing structured data through Spark SQL. Data Sources APIs in both V1 and 
> V2 support optimizations such as Filter push down and Column pruning which 
> are subset of the functionality that can be pushed down to some data sources. 
> We’re proposing to extend Data Sources APIs with join push down (JPD). Join 
> push down significantly improves query performance by reducing the amount of 
> data transfer and exploiting the capabilities of the data sources such as 
> index access.
> Join push down design document is available 
> [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24611) Clean up OutputCommitCoordinator

2018-06-22 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520578#comment-16520578
 ] 

Marcelo Vanzin commented on SPARK-24611:


One more: adjust the test so that it ensures that state is kept if multiple 
{{stageStart}} calls are made.

> Clean up OutputCommitCoordinator
> 
>
> Key: SPARK-24611
> URL: https://issues.apache.org/jira/browse/SPARK-24611
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This is a follow up to SPARK-24589, to address some issues brought up during 
> review of the change:
> - the DAGScheduler registers all stages with the coordinator, when at first 
> view only result stages need to. That would save memory in the driver.
> - the coordinator can track task IDs instead of the internal "TaskIdentifier" 
> type it uses; that would also save some memory, and also be more accurate.
> - {{TaskCommitDenied}} currently has a "job ID" when it's really a stage ID, 
> and it contains the task attempt number, when it should probably have the 
> task ID instead (like above).
> The latter is an API breakage (in a class tagged as developer API, but 
> still), and also affects data written to event logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate

2018-06-22 Thread vaquar khan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520576#comment-16520576
 ] 

vaquar khan edited comment on SPARK-24631 at 6/22/18 4:43 PM:
--

Database type (MYSQL,Hbase etc ), column descriptions (int,bigdecimal etc ) all 
possible detail will help to reproduce issue.

What type of connection are you using JDBC , database version , spark version ?


was (Author: vaquar.k...@gmail.com):
Database type (MYSQL,Hbase etc ), column descriptions (int,bigdecimal etc ) all 
possible detail will help to reproduce issue

> Cannot up cast column from bigint to smallint as it may truncate
> 
>
> Key: SPARK-24631
> URL: https://issues.apache.org/jira/browse/SPARK-24631
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Sivakumar
>Priority: Major
>
> Getting the below error when executing the simple select query,
> Sample:
> Table Description:
> name: String, id: BigInt
> val df=spark.sql("select name,id from testtable")
> ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as 
> it may truncate.{color}
> I am not doing any transformation's, I am just trying to query a table ,But 
> still I am getting the error.
> I am getting this error only on production cluster and only for a single 
> table, other tables are running fine.
> + more data,
> val df=spark.sql("select* from table_name")
> I am just trying this query a table. But with other tables it is running fine.
> {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: 
> org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from 
> bigint to column_name#2525: smallint as it may truncate.{color}
> that specific column is having Bigint datatype, But there were other table's 
> that ran fine with Bigint columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate

2018-06-22 Thread vaquar khan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520576#comment-16520576
 ] 

vaquar khan commented on SPARK-24631:
-

Database type (MYSQL,Hbase etc ), column descriptions (int,bigdecimal etc ) all 
possible detail will help to reproduce issue

> Cannot up cast column from bigint to smallint as it may truncate
> 
>
> Key: SPARK-24631
> URL: https://issues.apache.org/jira/browse/SPARK-24631
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Sivakumar
>Priority: Major
>
> Getting the below error when executing the simple select query,
> Sample:
> Table Description:
> name: String, id: BigInt
> val df=spark.sql("select name,id from testtable")
> ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as 
> it may truncate.{color}
> I am not doing any transformation's, I am just trying to query a table ,But 
> still I am getting the error.
> I am getting this error only on production cluster and only for a single 
> table, other tables are running fine.
> + more data,
> val df=spark.sql("select* from table_name")
> I am just trying this query a table. But with other tables it is running fine.
> {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: 
> org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from 
> bigint to column_name#2525: smallint as it may truncate.{color}
> that specific column is having Bigint datatype, But there were other table's 
> that ran fine with Bigint columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24130) Data Source V2: Join Push Down

2018-06-22 Thread Jia Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Li updated SPARK-24130:
---
Attachment: Data Source V2 Join Push Down.pdf

> Data Source V2: Join Push Down
> --
>
> Key: SPARK-24130
> URL: https://issues.apache.org/jira/browse/SPARK-24130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jia Li
>Priority: Major
> Attachments: Data Source V2 Join Push Down.pdf
>
>
> Spark applications often directly query external data sources such as 
> relational databases, or files. Spark provides Data Sources APIs for 
> accessing structured data through Spark SQL. Data Sources APIs in both V1 and 
> V2 support optimizations such as Filter push down and Column pruning which 
> are subset of the functionality that can be pushed down to some data sources. 
> We’re proposing to extend Data Sources APIs with join push down (JPD). Join 
> push down significantly improves query performance by reducing the amount of 
> data transfer and exploiting the capabilities of the data sources such as 
> index access.
> Join push down design document is available 
> [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate

2018-06-22 Thread Sivakumar (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520551#comment-16520551
 ] 

Sivakumar commented on SPARK-24631:
---

Updated with some additional data. I have tried the same in spark2-shell as 
well, But it is throwing the same error for that particular table.

> Cannot up cast column from bigint to smallint as it may truncate
> 
>
> Key: SPARK-24631
> URL: https://issues.apache.org/jira/browse/SPARK-24631
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Sivakumar
>Priority: Major
>
> Getting the below error when executing the simple select query,
> Sample:
> Table Description:
> name: String, id: BigInt
> val df=spark.sql("select name,id from testtable")
> ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as 
> it may truncate.{color}
> I am not doing any transformation's, I am just trying to query a table ,But 
> still I am getting the error.
> I am getting this error only on production cluster and only for a single 
> table, other tables are running fine.
> + more data,
> val df=spark.sql("select* from table_name")
> I am just trying this query a table. But with other tables it is running fine.
> {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: 
> org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from 
> bigint to column_name#2525: smallint as it may truncate.{color}
> that specific column is having Bigint datatype, But there were other table's 
> that ran fine with Bigint columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate

2018-06-22 Thread Sivakumar (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sivakumar updated SPARK-24631:
--
Description: 
Getting the below error when executing the simple select query,

Sample:

Table Description:

name: String, id: BigInt

val df=spark.sql("select name,id from testtable")

ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as it 
may truncate.{color}

I am not doing any transformation's, I am just trying to query a table ,But 
still I am getting the error.

I am getting this error only on production cluster and only for a single table, 
other tables are running fine.

+ more data,

val df=spark.sql("select* from table_name")

I am just trying this query a table. But with other tables it is running fine.

{color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: 
org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from 
bigint to column_name#2525: smallint as it may truncate.{color}

that specific column is having Bigint datatype, But there were other table's 
that ran fine with Bigint columns.

 

  was:
Getting the below error when executing the simple select query,

Sample:

Table Description:

name: String, id: BigInt

val df=spark.sql("select name,id from testtable")

ERROR: {color:#FF}Cannot up cast column "id" from bigint to smallint as it 
may truncate.{color}

I am not doing any transformation's, I am just trying to query a table ,But 
still I am getting the error.

I am getting this error only on production cluster and only for a single table, 
other tables are running fine.

 


> Cannot up cast column from bigint to smallint as it may truncate
> 
>
> Key: SPARK-24631
> URL: https://issues.apache.org/jira/browse/SPARK-24631
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Sivakumar
>Priority: Major
>
> Getting the below error when executing the simple select query,
> Sample:
> Table Description:
> name: String, id: BigInt
> val df=spark.sql("select name,id from testtable")
> ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as 
> it may truncate.{color}
> I am not doing any transformation's, I am just trying to query a table ,But 
> still I am getting the error.
> I am getting this error only on production cluster and only for a single 
> table, other tables are running fine.
> + more data,
> val df=spark.sql("select* from table_name")
> I am just trying this query a table. But with other tables it is running fine.
> {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: 
> org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from 
> bigint to column_name#2525: smallint as it may truncate.{color}
> that specific column is having Bigint datatype, But there were other table's 
> that ran fine with Bigint columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate

2018-06-22 Thread vaquar khan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520533#comment-16520533
 ] 

vaquar khan edited comment on SPARK-24631 at 6/22/18 3:57 PM:
--

Can you add complete error logs and if possible smalll code example with test 
data ( data you can find in your prod table where it's failing )


was (Author: vaquar.k...@gmail.com):
Can you add complete error logs and if possible smalll code example with test 
data ( data you can find in your prod table where it's failing 

> Cannot up cast column from bigint to smallint as it may truncate
> 
>
> Key: SPARK-24631
> URL: https://issues.apache.org/jira/browse/SPARK-24631
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Sivakumar
>Priority: Major
>
> Getting the below error when executing the simple select query,
> Sample:
> Table Description:
> name: String, id: BigInt
> val df=spark.sql("select name,id from testtable")
> ERROR: {color:#FF}Cannot up cast column "id" from bigint to smallint as 
> it may truncate.{color}
> I am not doing any transformation's, I am just trying to query a table ,But 
> still I am getting the error.
> I am getting this error only on production cluster and only for a single 
> table, other tables are running fine.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate

2018-06-22 Thread vaquar khan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520533#comment-16520533
 ] 

vaquar khan commented on SPARK-24631:
-

Can you add complete error logs and if possible smalll code example with test 
data ( data you can find in your prod table where it's failing 

> Cannot up cast column from bigint to smallint as it may truncate
> 
>
> Key: SPARK-24631
> URL: https://issues.apache.org/jira/browse/SPARK-24631
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Sivakumar
>Priority: Major
>
> Getting the below error when executing the simple select query,
> Sample:
> Table Description:
> name: String, id: BigInt
> val df=spark.sql("select name,id from testtable")
> ERROR: {color:#FF}Cannot up cast column "id" from bigint to smallint as 
> it may truncate.{color}
> I am not doing any transformation's, I am just trying to query a table ,But 
> still I am getting the error.
> I am getting this error only on production cluster and only for a single 
> table, other tables are running fine.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23710) Upgrade Hive to 2.3.2

2018-06-22 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520522#comment-16520522
 ] 

Marcelo Vanzin commented on SPARK-23710:


There are a few places in Spark that are affected by a Hive upgrade:
- Hive serde support
- Hive UD(*)F support
- The thrift server

The first two are for supporting Hive's API in Spark so people can keep using 
their serdes and udfs. The risk here is that we're crossing a Hive major 
version boundary, and things in the API may have been broken, and that would 
transitively affect Spark's API.

In the real world that's already sort of a risk, though, because people might 
be running Hive 2 and thus have Hive 2 serdes in their tables, and Spark trying 
to read or write data to that table with an old version of the same serde could 
cause issues.

I think switching to the Hive mainline is a good medium or long term goal, but 
that probably would require a major Spark version to be more palatable - and 
perhaps should be coupled with deprecation of some features so that we can 
isolate ourselves from Hive more. It's a bit risky in a minor version.

In the short term my preference would be to either fix the fork, or go with 
Saisai's patch in HIVE-16391, which requires collaboration from the Hive side...


> Upgrade Hive to 2.3.2
> -
>
> Key: SPARK-23710
> URL: https://issues.apache.org/jira/browse/SPARK-23710
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Critical
>
> h1. Mainly changes
>  * Maven dependency:
>  hive.version from {{1.2.1.spark2}} to {{2.3.2}} and change 
> {{hive.classifier}} to {{core}}
>  calcite.version from {{1.2.0-incubating}} to {{1.10.0}}
>  datanucleus-core.version from {{3.2.10}} to {{4.1.17}}
>  remove {{orc.classifier}}, it means orc use the {{hive.storage.api}}, see: 
> ORC-174
>  add new dependency {{avatica}} and {{hive.storage.api}}
>  * ORC compatibility changes:
>  OrcColumnVector.java, OrcColumnarBatchReader.java, OrcDeserializer.scala, 
> OrcFilters.scala, OrcSerializer.scala, OrcFilterSuite.scala
>  * hive-thriftserver java file update:
>  update {{sql/hive-thriftserver/if/TCLIService.thrift}} to hive 2.3.2
>  update {{sql/hive-thriftserver/src/main/java/org/apache/hive/service/*}} to 
> hive 2.3.2
>  * TestSuite should update:
> ||TestSuite||Reason||
> |StatisticsSuite|HIVE-16098|
> |SessionCatalogSuite|Similar to [VersionsSuite.scala#L427|#L427]|
> |CliSuite, HiveThriftServer2Suites, HiveSparkSubmitSuite, HiveQuerySuite, 
> SQLQuerySuite|Update hive-hcatalog-core-0.13.1.jar to 
> hive-hcatalog-core-2.3.2.jar|
> |SparkExecuteStatementOperationSuite|Interface changed from 
> org.apache.hive.service.cli.Type.NULL_TYPE to 
> org.apache.hadoop.hive.serde2.thrift.Type.NULL_TYPE|
> |ClasspathDependenciesSuite|org.apache.hive.com.esotericsoftware.kryo.Kryo 
> change to com.esotericsoftware.kryo.Kryo|
> |HiveMetastoreCatalogSuite|Result format changed from Seq("1.1\t1", "2.1\t2") 
> to Seq("1.100\t1", "2.100\t2")|
> |HiveOrcFilterSuite|Result format changed|
> |HiveDDLSuite|Remove $ (This change needs to be reconsidered)|
> |HiveExternalCatalogVersionsSuite| java.lang.ClassCastException: 
> org.datanucleus.identity.DatastoreIdImpl cannot be cast to 
> org.datanucleus.identity.OID|
>  * Other changes:
> Close hive schema verification:  
> [HiveClientImpl.scala#L251|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L251]
>  and 
> [HiveExternalCatalog.scala#L58|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L58]
> Update 
> [IsolatedClientLoader.scala#L189-L192|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L189-L192]
> Because Hive 2.3.2's {{org.apache.hadoop.hive.ql.metadata.Hive}} can't 
> connect to Hive 1.x metastore, We should use 
> {{HiveMetaStoreClient.getDelegationToken}} instead of 
> {{Hive.getDelegationToken}} and update {{HiveClientImpl.toHiveTable}}
> All changes can be found at 
> [PR-20659|https://github.com/apache/spark/pull/20659].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate

2018-06-22 Thread Sivakumar (JIRA)

Sivakumar created SPARK-24631:
-

 Summary: Cannot up cast column from bigint to smallint as it may 
truncate
 Key: SPARK-24631
 URL: https://issues.apache.org/jira/browse/SPARK-24631
 Project: Spark
  Issue Type: New JIRA Project
  Components: Spark Core, Spark Submit
Affects Versions: 2.2.1
Reporter: Sivakumar


Getting the below error when executing the simple select query,

Sample:

Table Description:

name: String, id: BigInt

val df=spark.sql("select name,id from testtable")

ERROR: {color:#FF}Cannot up cast column "id" from bigint to smallint as it 
may truncate.{color}

I am not doing any transformation's, I am just trying to query a table ,But 
still I am getting the error.

I am getting this error only on production cluster and only for a single table, 
other tables are running fine.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24130) Data Source V2: Join Push Down

2018-06-22 Thread vaquar khan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520461#comment-16520461
 ] 

vaquar khan edited comment on SPARK-24130 at 6/22/18 3:02 PM:
--

Could you please attached doc in Jira insted of google doc 


was (Author: vaquar.k...@gmail.com):
Could you please update doc in Jira insted of google doc 

> Data Source V2: Join Push Down
> --
>
> Key: SPARK-24130
> URL: https://issues.apache.org/jira/browse/SPARK-24130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jia Li
>Priority: Major
>
> Spark applications often directly query external data sources such as 
> relational databases, or files. Spark provides Data Sources APIs for 
> accessing structured data through Spark SQL. Data Sources APIs in both V1 and 
> V2 support optimizations such as Filter push down and Column pruning which 
> are subset of the functionality that can be pushed down to some data sources. 
> We’re proposing to extend Data Sources APIs with join push down (JPD). Join 
> push down significantly improves query performance by reducing the amount of 
> data transfer and exploiting the capabilities of the data sources such as 
> index access.
> Join push down design document is available 
> [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24130) Data Source V2: Join Push Down

2018-06-22 Thread vaquar khan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520461#comment-16520461
 ] 

vaquar khan commented on SPARK-24130:
-

Could you please update doc in Jira insted of google doc 

> Data Source V2: Join Push Down
> --
>
> Key: SPARK-24130
> URL: https://issues.apache.org/jira/browse/SPARK-24130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jia Li
>Priority: Major
>
> Spark applications often directly query external data sources such as 
> relational databases, or files. Spark provides Data Sources APIs for 
> accessing structured data through Spark SQL. Data Sources APIs in both V1 and 
> V2 support optimizations such as Filter push down and Column pruning which 
> are subset of the functionality that can be pushed down to some data sources. 
> We’re proposing to extend Data Sources APIs with join push down (JPD). Join 
> push down significantly improves query performance by reducing the amount of 
> data transfer and exploiting the capabilities of the data sources such as 
> index access.
> Join push down design document is available 
> [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24519) MapStatus has 2000 hardcoded

2018-06-22 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-24519.
---
   Resolution: Fixed
 Assignee: Hieu Tri Huynh
Fix Version/s: 2.4.0

> MapStatus has 2000 hardcoded
> 
>
> Key: SPARK-24519
> URL: https://issues.apache.org/jira/browse/SPARK-24519
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Hieu Tri Huynh
>Assignee: Hieu Tri Huynh
>Priority: Minor
> Fix For: 2.4.0
>
>
> MapStatus uses hardcoded value of 2000 partitions to determine if it should 
> use highly compressed map status. We should make it configurable to allow 
> users to more easily tune their jobs with respect to this without having for 
> them to modify their code to change the number of partitions.  Note we can 
> leave this as an internal/undocumented config for now until we have more 
> advise for the users on how to set this config.
> Some of my reasoning:
> The config gives you a way to easily change something without the user having 
> to change code, redeploy jar, and then run again. You can simply change the 
> config and rerun. It also allows for easier experimentation. Changing the # 
> of partitions has other side affects, whether good or bad is situation 
> dependent. It can be worse are you could be increasing # of output files when 
> you don't want to be, affects the # of tasks needs and thus executors to run 
> in parallel, etc.
> There have been various talks about this number at spark summits where people 
> have told customers to increase it to be 2001 partitions. Note if you just do 
> a search for spark 2000 partitions you will fine various things all talking 
> about this number.  This shows that people are modifying their code to take 
> this into account so it seems to me having this configurable would be better.
> Once we have more advice for users we could expose this and document 
> information on it.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24130) Data Source V2: Join Push Down

2018-06-22 Thread Parshuram V Patki (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520386#comment-16520386
 ] 

Parshuram V Patki commented on SPARK-24130:
---

[~jliwork] do you think this improvement will make it to spark 2.4.0?

> Data Source V2: Join Push Down
> --
>
> Key: SPARK-24130
> URL: https://issues.apache.org/jira/browse/SPARK-24130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jia Li
>Priority: Major
>
> Spark applications often directly query external data sources such as 
> relational databases, or files. Spark provides Data Sources APIs for 
> accessing structured data through Spark SQL. Data Sources APIs in both V1 and 
> V2 support optimizations such as Filter push down and Column pruning which 
> are subset of the functionality that can be pushed down to some data sources. 
> We’re proposing to extend Data Sources APIs with join push down (JPD). Join 
> push down significantly improves query performance by reducing the amount of 
> data transfer and exploiting the capabilities of the data sources such as 
> index access.
> Join push down design document is available 
> [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22897) Expose stageAttemptId in TaskContext

2018-06-22 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-22897:
--
Fix Version/s: 2.2.2

> Expose  stageAttemptId in TaskContext
> -
>
> Key: SPARK-22897
> URL: https://issues.apache.org/jira/browse/SPARK-22897
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.1
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>Priority: Minor
> Fix For: 2.2.2, 2.3.0
>
>
> Currently, there's no easy way for Executor to detect a new stage is launched 
> as stageAttemptId is missing. 
> I'd like to propose exposing stageAttemptId in TaskContext, and will send a 
> pr if community thinks it's a good thing.
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-06-22 Thread Jackey Lee (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-24630:
---
Attachment: SQLStreaming SPIP.pdf

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-06-22 Thread Jackey Lee (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackey Lee updated SPARK-24630:
---
Summary: SPIP: Support SQLStreaming in Spark  (was: Support SQLStreaming in 
Spark)

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24630) Support SQLStreaming in Spark

2018-06-22 Thread Jackey Lee (JIRA)

Jackey Lee created SPARK-24630:
--

 Summary: Support SQLStreaming in Spark
 Key: SPARK-24630
 URL: https://issues.apache.org/jira/browse/SPARK-24630
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1, 2.2.0
Reporter: Jackey Lee


At present, KafkaSQL, Flink SQL(which is actually based on Calcite), SQLStream, 
StormSQL all provide a stream type SQL interface, with which users with little 
knowledge about streaming,  can easily develop a flow system processing model. 
In Spark, we can also support SQL API based on StructStreamig.

To support for SQL Streaming, there are two key points: 
1, Analysis should be able to parse streaming type SQL. 
2, Analyzer should be able to map metadata information to the corresponding 
Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24629) thrift server memory leak when beeline connection quits

2018-06-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24629:


Assignee: Apache Spark

> thrift server memory leak when beeline connection quits
> ---
>
> Key: SPARK-24629
> URL: https://issues.apache.org/jira/browse/SPARK-24629
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: StephenZou
>Assignee: Apache Spark
>Priority: Minor
> Attachments: .png, .png
>
>
> When Beeline connection closes, spark thrift server (STS) will send a session 
> close event, then the relevant listener in class HiveThriftServer2 will clean 
> its internal session states.
> But the internal statement state is not updated, (see graph attached).  So it 
> remains in state Compiled and doesn't be swept out. 
> Another graph is that after fixing it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24629) thrift server memory leak when beeline connection quits

2018-06-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24629:


Assignee: (was: Apache Spark)

> thrift server memory leak when beeline connection quits
> ---
>
> Key: SPARK-24629
> URL: https://issues.apache.org/jira/browse/SPARK-24629
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: StephenZou
>Priority: Minor
> Attachments: .png, .png
>
>
> When Beeline connection closes, spark thrift server (STS) will send a session 
> close event, then the relevant listener in class HiveThriftServer2 will clean 
> its internal session states.
> But the internal statement state is not updated, (see graph attached).  So it 
> remains in state Compiled and doesn't be swept out. 
> Another graph is that after fixing it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24629) thrift server memory leak when beeline connection quits

2018-06-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520219#comment-16520219
 ] 

Apache Spark commented on SPARK-24629:
--

User 'ChenjunZou' has created a pull request for this issue:
https://github.com/apache/spark/pull/21613

> thrift server memory leak when beeline connection quits
> ---
>
> Key: SPARK-24629
> URL: https://issues.apache.org/jira/browse/SPARK-24629
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: StephenZou
>Priority: Minor
> Attachments: .png, .png
>
>
> When Beeline connection closes, spark thrift server (STS) will send a session 
> close event, then the relevant listener in class HiveThriftServer2 will clean 
> its internal session states.
> But the internal statement state is not updated, (see graph attached).  So it 
> remains in state Compiled and doesn't be swept out. 
> Another graph is that after fixing it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24629) thrift server memory leak when beeline connection quits

2018-06-22 Thread StephenZou (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StephenZou updated SPARK-24629:
---
Description: 
When Beeline connection closes, spark thrift server (STS) will send a session 
close event, then the relevant listener in class HiveThriftServer2 will clean 
its internal session states.

But the internal statement state is not updated, (see graph attached).  So it 
remains in state Compiled and doesn't be swept out. 

Another graph is that after fixing it.

  was:
When Beeline connection closes, spark thrift server (STS) will send a session 
close event, then the relevant listener in class HiveThriftServer2 will clean 
its internal session states.

But the internal statement state is not updated, (see graph attached).  So it 
remains in state Compiled and doesn't be swept out. 


> thrift server memory leak when beeline connection quits
> ---
>
> Key: SPARK-24629
> URL: https://issues.apache.org/jira/browse/SPARK-24629
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: StephenZou
>Priority: Minor
> Attachments: .png, .png
>
>
> When Beeline connection closes, spark thrift server (STS) will send a session 
> close event, then the relevant listener in class HiveThriftServer2 will clean 
> its internal session states.
> But the internal statement state is not updated, (see graph attached).  So it 
> remains in state Compiled and doesn't be swept out. 
> Another graph is that after fixing it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24629) thrift server memory leak when beeline connection quits

2018-06-22 Thread StephenZou (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StephenZou updated SPARK-24629:
---
Attachment: .png

> thrift server memory leak when beeline connection quits
> ---
>
> Key: SPARK-24629
> URL: https://issues.apache.org/jira/browse/SPARK-24629
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: StephenZou
>Priority: Minor
> Attachments: .png, .png
>
>
> When Beeline connection closes, spark thrift server (STS) will send a session 
> close event, then the relevant listener in class HiveThriftServer2 will clean 
> its internal session states.
> But the internal statement state is not updated, (see graph attached).  So it 
> remains in state Compiled and doesn't be swept out. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24629) thrift server memory leak when beeline connection quits

2018-06-22 Thread StephenZou (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StephenZou updated SPARK-24629:
---
Attachment: .png

> thrift server memory leak when beeline connection quits
> ---
>
> Key: SPARK-24629
> URL: https://issues.apache.org/jira/browse/SPARK-24629
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: StephenZou
>Priority: Minor
> Attachments: .png
>
>
> When Beeline connection closes, spark thrift server (STS) will send a session 
> close event, then the relevant listener in class HiveThriftServer2 will clean 
> its internal session states.
> But the internal statement state is not updated, (see graph attached).  So it 
> remains in state Compiled and doesn't be swept out. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24629) thrift server memory leak when beeline connection quits

2018-06-22 Thread StephenZou (JIRA)

StephenZou created SPARK-24629:
--

 Summary: thrift server memory leak when beeline connection quits
 Key: SPARK-24629
 URL: https://issues.apache.org/jira/browse/SPARK-24629
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1, 2.2.1
Reporter: StephenZou


When Beeline connection closes, spark thrift server (STS) will send a session 
close event, then the relevant listener in class HiveThriftServer2 will clean 
its internal session states.

But the internal statement state is not updated, (see graph attached).  So it 
remains in state Compiled and doesn't be swept out. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24458) Invalid PythonUDF check_1(), requires attributes from more than one child

2018-06-22 Thread Ruben Berenguel (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520165#comment-16520165
 ] 

Ruben Berenguel commented on SPARK-24458:
-

[~hyukjin.kwon] I just built 2.3.0 from the tagged branch and it's still 
passing, but the branch points to 2.3.2. I tried building 2.2 from the branch 
2.2, but the build is failing for me locally. What is the best way to roll back 
locally to older versions of Spark?

> Invalid PythonUDF check_1(), requires attributes from more than one child
> -
>
> Key: SPARK-24458
> URL: https://issues.apache.org/jira/browse/SPARK-24458
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Spark 2.3.0 (local mode)
> Mac OSX
>Reporter: Abdeali Kothari
>Priority: Major
>
> I was trying out a very large query execution plan I have and I got the error:
>  
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o359.simpleString.
> : java.lang.RuntimeException: Invalid PythonUDF check_1(), requires 
> attributes from more than one child.
>  at scala.sys.package$.error(package.scala:27)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:182)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:181)
>  at scala.collection.immutable.Stream.foreach(Stream.scala:594)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:181)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114)
>  at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87)
>  at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>  at scala.collection.immutable.List.foldLeft(List.scala:84)
>  at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:87)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:187)
>  at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:187)
>  at 
> org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:187)
>  at sun.reflect.NativeMethodAccessorIm

[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen

2018-06-22 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520161#comment-16520161
 ] 

Takeshi Yamamuro commented on SPARK-24498:
--

yea, that might be true now. But, I think we need to think more about compiler 
options and other conditions for getting any benefit. welcome any comment and 
suggestion! Thanks for you comment!

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20295) when spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue

2018-06-22 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520148#comment-16520148
 ] 

Yuming Wang commented on SPARK-20295:
-

[~KevinZwx] Can you try [https://github.com/Intel-bigdata/spark-adaptive]?

 

> when  spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue
> --
>
> Key: SPARK-20295
> URL: https://issues.apache.org/jira/browse/SPARK-20295
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL
>Affects Versions: 2.1.0
>Reporter: Ruhui Wang
>Priority: Major
>
> when run  tpcds-q95, and set  spark.sql.adaptive.enabled = true the physical 
> plan firstly:
> Sort
> :  +- Exchange(coordinator id: 1)
> : +- Project***
> ::-Sort **
> ::  +- Exchange(coordinator id: 2)
> :: :- Project ***
> :+- Sort
> ::  +- Exchange(coordinator id: 3)
>  spark.sql.exchange.reuse is opened, then physical plan will become below:
> Sort
> :  +- Exchange(coordinator id: 1)
> : +- Project***
> ::-Sort **
> ::  +- Exchange(coordinator id: 2)
> :: :- Project ***
> :+- Sort
> ::  +- ReusedExchange  Exchange(coordinator id: 2)
> If spark.sql.adaptive.enabled = true,  the code stack is : 
> ShuffleExchange#doExecute --> postShuffleRDD function --> 
> doEstimationIfNecessary . In this function, 
> assert(exchanges.length == numExchanges) will be error, as left side has only 
> one element, but right is equal to 2.
> If this is a bug of spark.sql.adaptive.enabled and exchange resue?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24628) The example given to create a dense matrix using python has a mistake

2018-06-22 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520141#comment-16520141
 ] 

Apache Spark commented on SPARK-24628:
--

User 'huangweizhe123' has created a pull request for this issue:
https://github.com/apache/spark/pull/21612

> The example given to create a dense matrix using python has a mistake
> -
>
> Key: SPARK-24628
> URL: https://issues.apache.org/jira/browse/SPARK-24628
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Affects Versions: 2.3.1
>Reporter: Weizhe Huang
>Priority: Minor
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24628) The example given to create a dense matrix using python has a mistake

2018-06-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24628:


Assignee: (was: Apache Spark)

> The example given to create a dense matrix using python has a mistake
> -
>
> Key: SPARK-24628
> URL: https://issues.apache.org/jira/browse/SPARK-24628
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Affects Versions: 2.3.1
>Reporter: Weizhe Huang
>Priority: Minor
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24628) The example given to create a dense matrix using python has a mistake

2018-06-22 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24628:


Assignee: Apache Spark

> The example given to create a dense matrix using python has a mistake
> -
>
> Key: SPARK-24628
> URL: https://issues.apache.org/jira/browse/SPARK-24628
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Affects Versions: 2.3.1
>Reporter: Weizhe Huang
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24628) The example given to create a dense matrix using python has a mistake

2018-06-22 Thread Weizhe Huang (JIRA)

Weizhe Huang created SPARK-24628:


 Summary: The example given to create a dense matrix using python 
has a mistake
 Key: SPARK-24628
 URL: https://issues.apache.org/jira/browse/SPARK-24628
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Affects Versions: 2.3.1
Reporter: Weizhe Huang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen

2018-06-22 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520125#comment-16520125
 ] 

Marco Gaido commented on SPARK-24498:
-

Thanks for your great analysis [~maropu]! Very interesting. Seems like there is 
no advantage in introducing a new compiler.

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23603) When the length of the json is in a range,get_json_object will result in missing tail data

2018-06-22 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23603.
--
Resolution: Duplicate

2.7.x has a regression we had to revert it back. See also 
https://github.com/apache/spark/pull/9759

> When the length of the json is in a range,get_json_object will result in 
> missing tail data
> --
>
> Key: SPARK-23603
> URL: https://issues.apache.org/jira/browse/SPARK-23603
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.2.0, 2.3.0
>Reporter: dzcxzl
>Priority: Major
>
> Jackson(>=2.7.7) fixes the possibility of missing tail data when the length 
> of the value is in a range
> [https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7]
> [https://github.com/FasterXML/jackson-core/issues/307]
> spark-shell:
> {code:java}
> val value = "x" * 3000
> val json = s"""{"big": "$value"}"""
> spark.sql("select length(get_json_object(\'"+json+"\','$.big'))" ).collect
> res0: Array[org.apache.spark.sql.Row] = Array([2991])
> {code}
> expect result : 3000 
>  actual result  : 2991
> There are two solutions
>  One is
> *Bump jackson from 2.6.7&2.6.7.1 to 2.7.7*
>  The other one is
>  *Replace writeRaw(char[] text, int offset, int len) with writeRaw(String 
> text)*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen

2018-06-22 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520090#comment-16520090
 ] 

Takeshi Yamamuro commented on SPARK-24498:
--

If there are javac options related to performance, bytecode size, compile time, 
..., please let me know. I'm currently looking into them now.

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23934) High-order function: map_from_entries(array>) → map

2018-06-22 Thread Takuya Ueshin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23934.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21282
https://github.com/apache/spark/pull/21282

> High-order function: map_from_entries(array>) → map
> --
>
> Key: SPARK-23934
> URL: https://issues.apache.org/jira/browse/SPARK-23934
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marek Novotny
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created from the given array of entries.
> {noformat}
> SELECT map_from_entries(ARRAY[(1, 'x'), (2, 'y')]); -- {1 -> 'x', 2 -> 'y'}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23934) High-order function: map_from_entries(array>) → map

2018-06-22 Thread Takuya Ueshin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23934:
-

Assignee: Marek Novotny

> High-order function: map_from_entries(array>) → map
> --
>
> Key: SPARK-23934
> URL: https://issues.apache.org/jira/browse/SPARK-23934
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marek Novotny
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created from the given array of entries.
> {noformat}
> SELECT map_from_entries(ARRAY[(1, 'x'), (2, 'y')]); -- {1 -> 'x', 2 -> 'y'}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24627) [Spark2.3.0] After HDFS Token expire kinit not able to submit job using beeline

2018-06-22 Thread ABHISHEK KUMAR GUPTA (JIRA)

ABHISHEK KUMAR GUPTA created SPARK-24627:


 Summary: [Spark2.3.0] After HDFS Token expire kinit not able to 
submit job using beeline
 Key: SPARK-24627
 URL: https://issues.apache.org/jira/browse/SPARK-24627
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
 Environment: OS: SUSE11

Spark Version: 2.3.0 

Hadoop: 2.8.3
Reporter: ABHISHEK KUMAR GUPTA


Steps:

beeline session was active.
1.Launch spark-beeline 
2. create table alt_s1 (time timestamp, name string, isright boolean, datetoday 
date, num binary, height double, score float, decimaler decimal(10,0), id 
tinyint, age int, license bigint, length smallint) row format delimited fields 
terminated by ',';
3. load data local inpath '/opt/typeddata60.txt' into table alt_s1;
4. show tables;( Table listed successfully )
5. select * from alt_s1;
Throws HDFS_DELEGATION_TOKEN Exception


0: jdbc:hive2://10.18.18.214:23040/default> select * from alt_s1;
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
1 in stage 22.0 failed 4 times, most recent failure: Lost task 1.3 in stage 
22.0 (TID 106, blr123110, executor 1): 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 token (HDFS_DELEGATION_TOKEN token 7 for spark) can't be found in cache
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:255)
at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1226)
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213)
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1201)
at 
org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:306)
at 
org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:272)
at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:264)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1526)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:109)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at 
org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

**Note: Even after kinit spark/hadoop  token is not getting renewed.**



Now Launch spark sql session ( Select * from alt_s1 ) is successful.
1. Launch spark-sql
2.spark-sql> select * from alt_s1;
2018-06-22 14:24:04 INFO  HiveMetaStore:746 - 0: get_table : db=test_one 
tbl=alt_s1
2018-06-22 14:24:04 INFO  audit:371 - ugi=spark/had...@hadoop.com   
ip=unknown-ip-addr  cmd=get_table : db=test_one tbl=alt_s1
2018-06-22 14:24:04 INFO  SQLStdHiveAccessController:95 - Created 
SQLStdHiveAccessController for session context : HiveAuthzSessionContext 
[sessionString=2cf6aac4-91c6-4c2d-871b-4d7620d91f43, clientType=HIVECLI]
2018-06-22 14:24:04 INFO  metastore:291 - Mestastore configuration 
hive.metastore.filter.hook changed from 
org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl to 
org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook
2018-06-22 14:24:04 INFO  HiveMetaStore:

[jira] [Comment Edited] (SPARK-24498) Add JDK compiler for runtime codegen

2018-06-22 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520056#comment-16520056
 ] 

Takeshi Yamamuro edited comment on SPARK-24498 at 6/22/18 6:59 AM:
---

Based on my rough patch 
(https://github.com/apache/spark/compare/master...maropu:JdkCompiler), the 
results of my investigation (sf=1 performance values, compile time, max method 
codegen size, and max class codegen size in TPCDS) are here: 
[https://docs.google.com/spreadsheets/d/1Mgdd9dfFaACXOUHqKfaeKrj09hB3X1j9sKTJlJ6UM6w/edit?usp=sharing]

Their performance values are not so different between each other and Jdk 
compile time is much larger than Janino time in q28-q72.

It seems the max method codegen size in Jdk is a little smaller than that in 
Janino though, the class size in Jdk is larger than that in Janino.


was (Author: maropu):
The results of my investigation (sf=1 performance values, compile time, max 
method codegen size, and max class codegen size in TPCDS) are here: 
[https://docs.google.com/spreadsheets/d/1Mgdd9dfFaACXOUHqKfaeKrj09hB3X1j9sKTJlJ6UM6w/edit?usp=sharing]

Their performance values are not so different between each other and Jdk 
compile time is much larger than Janino time in q28-q72.

It seems the max method codegen size in Jdk is a little smaller than that in 
Janino though, the class size in Jdk is larger than that in Janino.

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

83 matches

Mail list logo