[jira] [Created] (SPARK-23903) Add support for date extract

2018-04-08 Thread Xiao Li (JIRA)
Xiao Li created SPARK-23903:
---

 Summary: Add support for date extract
 Key: SPARK-23903
 URL: https://issues.apache.org/jira/browse/SPARK-23903
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li


Heavily used in timeseries based datasets. 

https://www.postgresql.org/docs/9.1/static/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT

{noformat}
EXTRACT(field FROM source)
{noformat}

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23902) Provide an option in months_between UDF to disable rounding-off

2018-04-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23902:

Description: 
https://issues.apache.org/jira/browse/HIVE-15511

{noformat}
Rounding off was added in {{GenericUDFMonthsBetween}} so that it can be 
compatible with systems like oracle. However, there are places where rounding 
off is not needed.

E.g "CAST(MONTHS_BETWEEN(l_shipdate, l_commitdate) / 12 AS INT)" may not need 
rounding off via BigDecimal which is compute intensive.
{noformat}

  was:
https://issues.apache.org/jira/browse/HIVE-15511

{noforrmat}

Rounding off was added in {{GenericUDFMonthsBetween}} so that it can be 
compatible with systems like oracle. However, there are places where rounding 
off is not needed.

E.g "CAST(MONTHS_BETWEEN(l_shipdate, l_commitdate) / 12 AS INT)" may not need 
rounding off via BigDecimal which is compute intensive.

{noformat}


> Provide an option in months_between UDF to disable rounding-off
> ---
>
> Key: SPARK-23902
> URL: https://issues.apache.org/jira/browse/SPARK-23902
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-15511
> {noformat}
> Rounding off was added in {{GenericUDFMonthsBetween}} so that it can be 
> compatible with systems like oracle. However, there are places where rounding 
> off is not needed.
> E.g "CAST(MONTHS_BETWEEN(l_shipdate, l_commitdate) / 12 AS INT)" may not need 
> rounding off via BigDecimal which is compute intensive.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23902) Provide an option in months_between UDF to disable rounding-off

2018-04-08 Thread Xiao Li (JIRA)
Xiao Li created SPARK-23902:
---

 Summary: Provide an option in months_between UDF to disable 
rounding-off
 Key: SPARK-23902
 URL: https://issues.apache.org/jira/browse/SPARK-23902
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li


https://issues.apache.org/jira/browse/HIVE-15511

{noforrmat}

Rounding off was added in {{GenericUDFMonthsBetween}} so that it can be 
compatible with systems like oracle. However, there are places where rounding 
off is not needed.

E.g "CAST(MONTHS_BETWEEN(l_shipdate, l_commitdate) / 12 AS INT)" may not need 
rounding off via BigDecimal which is compute intensive.

{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23901) Data Masking Functions

2018-04-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430111#comment-16430111
 ] 

Xiao Li commented on SPARK-23901:
-

We can separate this to multiple JIRAs.

> Data Masking Functions
> --
>
> Key: SPARK-23901
> URL: https://issues.apache.org/jira/browse/SPARK-23901
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> - mask()
>  - mask_first_n()
>  - mask_last_n()
>  - mask_hash()
>  - mask_show_first_n()
>  - mask_show_last_n()
> Reference:
> [1] 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions]
> [2] https://issues.apache.org/jira/browse/HIVE-13568
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23901) Data Masking Functions

2018-04-08 Thread Xiao Li (JIRA)
Xiao Li created SPARK-23901:
---

 Summary: Data Masking Functions
 Key: SPARK-23901
 URL: https://issues.apache.org/jira/browse/SPARK-23901
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li


- mask()
 - mask_first_n()
 - mask_last_n()
 - mask_hash()
 - mask_show_first_n()
 - mask_show_last_n()

Reference:

[1] 
[https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions]

[2] https://issues.apache.org/jira/browse/HIVE-13568

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23900) format_number udf should take user specifed format as argument

2018-04-08 Thread Xiao Li (JIRA)
Xiao Li created SPARK-23900:
---

 Summary: format_number udf should take user specifed format as 
argument
 Key: SPARK-23900
 URL: https://issues.apache.org/jira/browse/SPARK-23900
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li


https://issues.apache.org/jira/browse/HIVE-5370
{noformat}
Currently, format_number udf formats the number to #,###,###.##, but it should 
also take a user specified format as optional input.
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23899) Built-in SQL Function Improvement

2018-04-08 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23899:

Description: This umbrella JIRA is to improve compatibility with the other 
data processing systems, including Hive, Teradata, Presto, Postgres, MySQL, 
DB2, Oracle, and MS SQL Server.

> Built-in SQL Function Improvement
> -
>
> Key: SPARK-23899
> URL: https://issues.apache.org/jira/browse/SPARK-23899
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
> Fix For: 2.4.0
>
>
> This umbrella JIRA is to improve compatibility with the other data processing 
> systems, including Hive, Teradata, Presto, Postgres, MySQL, DB2, Oracle, and 
> MS SQL Server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23899) Built-in SQL Function Improvement

2018-04-08 Thread Xiao Li (JIRA)
Xiao Li created SPARK-23899:
---

 Summary: Built-in SQL Function Improvement
 Key: SPARK-23899
 URL: https://issues.apache.org/jira/browse/SPARK-23899
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li
 Fix For: 2.4.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22342) refactor schedulerDriver registration

2018-04-08 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429950#comment-16429950
 ] 

Susan X. Huynh commented on SPARK-22342:


Good news: I found the root cause of the multiple registration bug, and it is 
not a Spark bug. It is caused by a bug in libmesos: "using a failoverTimeout of 
0 with Mesos native scheduler client can result in infinite subscribe loop", 
https://issues.apache.org/jira/browse/MESOS-8171 . This bug leads to the 
multiple SUBSCRIBE calls seen in the driver logs. Upgrading the libmesos bundle 
in my Docker image to a version with this patch fixed the issue. cc [~skonto]

> refactor schedulerDriver registration
> -
>
> Key: SPARK-22342
> URL: https://issues.apache.org/jira/browse/SPARK-22342
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This is an umbrella issue for working on:
> https://github.com/apache/spark/pull/13143
> and handle the multiple re-registration issue which invalidates an offer.
> To test:
>  dcos spark run --verbose --name=spark-nohive  --submit-args="--driver-cores 
> 1 --conf spark.cores.max=1 --driver-memory 512M --class 
> org.apache.spark.examples.SparkPi http://.../spark-examples_2.11-2.2.0.jar;
> master log:
> I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3085 hierarchical.cpp:303] Added framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3085 hierarchical.cpp:412] Deactivated framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3090 hierarchical.cpp:380] Activated framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [  ]
> I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00 3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00 3087 master.cpp:7662] Sending 6 offers to framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00 3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10036
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 

[jira] [Commented] (SPARK-23883) Error with conversion to arrow while using pandas_udf

2018-04-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429941#comment-16429941
 ] 

Hyukjin Kwon commented on SPARK-23883:
--

We need to choose one side. I think documenting it should be good enough for 
now, if I understood correctly.

> Error with conversion to arrow while using pandas_udf
> -
>
> Key: SPARK-23883
> URL: https://issues.apache.org/jira/browse/SPARK-23883
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Spark 2.3.0
> Python 3.5
> Java 1.8.0_161-b12
>Reporter: Omri
>Priority: Major
>
> Hi,
> I have a code that works on DataBricks but doesn't work on a local spark 
> installation.
> This is the code I'm running:
> {code:java}
> from pyspark.sql.functions import pandas_udf
> import pandas as pd
> import numpy as np
> from pyspark.sql.types import *
> schema = StructType([
>   StructField("Distance", FloatType()),
>   StructField("CarId", IntegerType())
> ])
> def haversine(lon1, lat1, lon2, lat2):
> #Calculate distance, return scalar
> return 3.5 # Removed logic to facilitate reading
> @pandas_udf(schema)
> def totalDistance(oneCar):
> dist = haversine(oneCar.Longtitude.shift(1),
>  oneCar.Latitude.shift(1),
>  oneCar.loc[1:, 'Longitude'], 
>  oneCar.loc[1:, 'Latitude'])
> return 
> pd.DataFrame({"CarId":oneCar['CarId'].iloc[0],"Distance":np.sum(dist)},index 
> = [0])
> ## Calculate the overall distance made by each car
> distancePerCar= df.groupBy('CarId').apply(totalDistance)
> {code}
> I'm getting this exception, about Arrow not able to deal with this input:
> {noformat}
> ---
> TypeError Traceback (most recent call last)
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in 
> returnType(self)
> 114 try:
> --> 115 to_arrow_type(self._returnType_placeholder)
> 116 except TypeError:
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\types.py in 
> to_arrow_type(dt)
>1641 else:
> -> 1642 raise TypeError("Unsupported type in conversion to Arrow: " + 
> str(dt))
>1643 return arrow_type
> TypeError: Unsupported type in conversion to Arrow: 
> StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true)))
> During handling of the above exception, another exception occurred:
> NotImplementedError   Traceback (most recent call last)
>  in ()
>  18 km = 6367 * c
>  19 return km
> ---> 20 @pandas_udf("CarId: int, Distance: float")
>  21 def totalDistance(oneUser):
>  22 dist = haversine(oneUser.Longtitude.shift(1), 
> oneUser.Latitude.shift(1),
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in 
> _create_udf(f, returnType, evalType)
>  62 udf_obj = UserDefinedFunction(
>  63 f, returnType=returnType, name=None, evalType=evalType, 
> deterministic=True)
> ---> 64 return udf_obj._wrapped()
>  65 
>  66 
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in 
> _wrapped(self)
> 184 
> 185 wrapper.func = self.func
> --> 186 wrapper.returnType = self.returnType
> 187 wrapper.evalType = self.evalType
> 188 wrapper.deterministic = self.deterministic
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in 
> returnType(self)
> 117 raise NotImplementedError(
> 118 "Invalid returnType with scalar Pandas UDFs: %s 
> is "
> --> 119 "not supported" % 
> str(self._returnType_placeholder))
> 120 elif self.evalType == 
> PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
> 121 if isinstance(self._returnType_placeholder, StructType):
> NotImplementedError: Invalid returnType with scalar Pandas UDFs: 
> StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true)))
>  is not supported{noformat}
> I've also tried changing the schema to
> {code:java}
> @pandas_udf("") {code}
> and
> {code:java}
> @pandas_udf("CarId:int,Distance:float"){code}
>  
> As mentioned, this is working on a DataBricks instance in Azure, but not 
> locally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23883) Error with conversion to arrow while using pandas_udf

2018-04-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429940#comment-16429940
 ] 

Hyukjin Kwon commented on SPARK-23883:
--

Let's resolve it and open a new ticket. So basically you mean the output should 
be mapped by name but not position right?

> Error with conversion to arrow while using pandas_udf
> -
>
> Key: SPARK-23883
> URL: https://issues.apache.org/jira/browse/SPARK-23883
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Spark 2.3.0
> Python 3.5
> Java 1.8.0_161-b12
>Reporter: Omri
>Priority: Major
>
> Hi,
> I have a code that works on DataBricks but doesn't work on a local spark 
> installation.
> This is the code I'm running:
> {code:java}
> from pyspark.sql.functions import pandas_udf
> import pandas as pd
> import numpy as np
> from pyspark.sql.types import *
> schema = StructType([
>   StructField("Distance", FloatType()),
>   StructField("CarId", IntegerType())
> ])
> def haversine(lon1, lat1, lon2, lat2):
> #Calculate distance, return scalar
> return 3.5 # Removed logic to facilitate reading
> @pandas_udf(schema)
> def totalDistance(oneCar):
> dist = haversine(oneCar.Longtitude.shift(1),
>  oneCar.Latitude.shift(1),
>  oneCar.loc[1:, 'Longitude'], 
>  oneCar.loc[1:, 'Latitude'])
> return 
> pd.DataFrame({"CarId":oneCar['CarId'].iloc[0],"Distance":np.sum(dist)},index 
> = [0])
> ## Calculate the overall distance made by each car
> distancePerCar= df.groupBy('CarId').apply(totalDistance)
> {code}
> I'm getting this exception, about Arrow not able to deal with this input:
> {noformat}
> ---
> TypeError Traceback (most recent call last)
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in 
> returnType(self)
> 114 try:
> --> 115 to_arrow_type(self._returnType_placeholder)
> 116 except TypeError:
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\types.py in 
> to_arrow_type(dt)
>1641 else:
> -> 1642 raise TypeError("Unsupported type in conversion to Arrow: " + 
> str(dt))
>1643 return arrow_type
> TypeError: Unsupported type in conversion to Arrow: 
> StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true)))
> During handling of the above exception, another exception occurred:
> NotImplementedError   Traceback (most recent call last)
>  in ()
>  18 km = 6367 * c
>  19 return km
> ---> 20 @pandas_udf("CarId: int, Distance: float")
>  21 def totalDistance(oneUser):
>  22 dist = haversine(oneUser.Longtitude.shift(1), 
> oneUser.Latitude.shift(1),
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in 
> _create_udf(f, returnType, evalType)
>  62 udf_obj = UserDefinedFunction(
>  63 f, returnType=returnType, name=None, evalType=evalType, 
> deterministic=True)
> ---> 64 return udf_obj._wrapped()
>  65 
>  66 
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in 
> _wrapped(self)
> 184 
> 185 wrapper.func = self.func
> --> 186 wrapper.returnType = self.returnType
> 187 wrapper.evalType = self.evalType
> 188 wrapper.deterministic = self.deterministic
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in 
> returnType(self)
> 117 raise NotImplementedError(
> 118 "Invalid returnType with scalar Pandas UDFs: %s 
> is "
> --> 119 "not supported" % 
> str(self._returnType_placeholder))
> 120 elif self.evalType == 
> PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
> 121 if isinstance(self._returnType_placeholder, StructType):
> NotImplementedError: Invalid returnType with scalar Pandas UDFs: 
> StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true)))
>  is not supported{noformat}
> I've also tried changing the schema to
> {code:java}
> @pandas_udf("") {code}
> and
> {code:java}
> @pandas_udf("CarId:int,Distance:float"){code}
>  
> As mentioned, this is working on a DataBricks instance in Azure, but not 
> locally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23897) Guava version

2018-04-08 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429873#comment-16429873
 ] 

Herman van Hovell commented on SPARK-23897:
---

That is not going to happen for a minor release, since people (unfortunately) 
rely on this dependency. There are plans to shade all dependencies in Spark 
3.0, but that is at least 6 months away.

> Guava version
> -
>
> Key: SPARK-23897
> URL: https://issues.apache.org/jira/browse/SPARK-23897
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
>
> Guava dependency version 14 is pretty old, needs to be updated to at least 
> 16, google cloud storage connector uses newer one which causes pretty popular 
> error with guava; "java.lang.NoSuchMethodError: 
> com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;"
>  and causes app to crash



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23897) Guava version

2018-04-08 Thread Sercan Karaoglu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429868#comment-16429868
 ] 

Sercan Karaoglu commented on SPARK-23897:
-

What about shading it

> Guava version
> -
>
> Key: SPARK-23897
> URL: https://issues.apache.org/jira/browse/SPARK-23897
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
>
> Guava dependency version 14 is pretty old, needs to be updated to at least 
> 16, google cloud storage connector uses newer one which causes pretty popular 
> error with guava; "java.lang.NoSuchMethodError: 
> com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;"
>  and causes app to crash



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23897) Guava version

2018-04-08 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429866#comment-16429866
 ] 

Herman van Hovell commented on SPARK-23897:
---

This is a duplicate or SPARK-23854.

We are not going to upgrade Guava any time soon. This is notoriously hard to do 
because it is used in a lot Spark's dependencies and the guava developers 
aggressively remove deprecated APIs; updating can easily break stuff (missing 
methods that sort of thing). See the discussion in 
[https://github.com/apache/spark/pull/20966]  for some more context.

> Guava version
> -
>
> Key: SPARK-23897
> URL: https://issues.apache.org/jira/browse/SPARK-23897
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
>
> Guava dependency version 14 is pretty old, needs to be updated to at least 
> 16, google cloud storage connector uses newer one which causes pretty popular 
> error with guava; "java.lang.NoSuchMethodError: 
> com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;"
>  and causes app to crash



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23898) Simplify code generation for Add/Subtract with CalendarIntervals

2018-04-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23898:


Assignee: Apache Spark  (was: Herman van Hovell)

> Simplify code generation for Add/Subtract with CalendarIntervals
> 
>
> Key: SPARK-23898
> URL: https://issues.apache.org/jira/browse/SPARK-23898
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23898) Simplify code generation for Add/Subtract with CalendarIntervals

2018-04-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23898:


Assignee: Herman van Hovell  (was: Apache Spark)

> Simplify code generation for Add/Subtract with CalendarIntervals
> 
>
> Key: SPARK-23898
> URL: https://issues.apache.org/jira/browse/SPARK-23898
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23898) Simplify code generation for Add/Subtract with CalendarIntervals

2018-04-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429865#comment-16429865
 ] 

Apache Spark commented on SPARK-23898:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/21005

> Simplify code generation for Add/Subtract with CalendarIntervals
> 
>
> Key: SPARK-23898
> URL: https://issues.apache.org/jira/browse/SPARK-23898
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23883) Error with conversion to arrow while using pandas_udf

2018-04-08 Thread Omri (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429864#comment-16429864
 ] 

Omri commented on SPARK-23883:
--

Yes it does. Thank you! I missed that part on the documentation.

I did find a weird behavior related to the order of the objects in the struct 
(if you wish, I can open a new issue on this).

When I define the schema as this one:

 
{code:java}
StructType([
  StructField("CarId", IntegerType()),
  StructField("Distance", FloatType())
])
{code}
It doesn't use the names of the returned data frame by the pandas_udf, which 
results in a wrong assignment of types. The CarId would get a float value and 
the Distance would get cast into Integer.

 

Here's the result for example:
{code:java}
+-++
|CarId|Distance|
+-++
|3|29.0|
|3|65.0|
|3|   191.0|
|3|   222.0|
|3|19.0|
{code}
The pandas_udf returns 3.5 which gets truncated into 3.

When I replace the order of the struct into
{code:java}
schema = StructType([
  StructField("Distance", FloatType()),
  StructField("CarId", IntegerType())
])
{code}
I get this result:
{code:java}
++-+
|Distance|CarId|
++-+
| 3.5|   29|
| 3.5|   65|
| 3.5|  191|
| 3.5|  222|
| 3.5|   19|
{code}
I would assume that Spark would map the names of the returned pandas data frame 
columns with the StructField names.

 

Thanks again

> Error with conversion to arrow while using pandas_udf
> -
>
> Key: SPARK-23883
> URL: https://issues.apache.org/jira/browse/SPARK-23883
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Spark 2.3.0
> Python 3.5
> Java 1.8.0_161-b12
>Reporter: Omri
>Priority: Major
>
> Hi,
> I have a code that works on DataBricks but doesn't work on a local spark 
> installation.
> This is the code I'm running:
> {code:java}
> from pyspark.sql.functions import pandas_udf
> import pandas as pd
> import numpy as np
> from pyspark.sql.types import *
> schema = StructType([
>   StructField("Distance", FloatType()),
>   StructField("CarId", IntegerType())
> ])
> def haversine(lon1, lat1, lon2, lat2):
> #Calculate distance, return scalar
> return 3.5 # Removed logic to facilitate reading
> @pandas_udf(schema)
> def totalDistance(oneCar):
> dist = haversine(oneCar.Longtitude.shift(1),
>  oneCar.Latitude.shift(1),
>  oneCar.loc[1:, 'Longitude'], 
>  oneCar.loc[1:, 'Latitude'])
> return 
> pd.DataFrame({"CarId":oneCar['CarId'].iloc[0],"Distance":np.sum(dist)},index 
> = [0])
> ## Calculate the overall distance made by each car
> distancePerCar= df.groupBy('CarId').apply(totalDistance)
> {code}
> I'm getting this exception, about Arrow not able to deal with this input:
> {noformat}
> ---
> TypeError Traceback (most recent call last)
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in 
> returnType(self)
> 114 try:
> --> 115 to_arrow_type(self._returnType_placeholder)
> 116 except TypeError:
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\types.py in 
> to_arrow_type(dt)
>1641 else:
> -> 1642 raise TypeError("Unsupported type in conversion to Arrow: " + 
> str(dt))
>1643 return arrow_type
> TypeError: Unsupported type in conversion to Arrow: 
> StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true)))
> During handling of the above exception, another exception occurred:
> NotImplementedError   Traceback (most recent call last)
>  in ()
>  18 km = 6367 * c
>  19 return km
> ---> 20 @pandas_udf("CarId: int, Distance: float")
>  21 def totalDistance(oneUser):
>  22 dist = haversine(oneUser.Longtitude.shift(1), 
> oneUser.Latitude.shift(1),
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in 
> _create_udf(f, returnType, evalType)
>  62 udf_obj = UserDefinedFunction(
>  63 f, returnType=returnType, name=None, evalType=evalType, 
> deterministic=True)
> ---> 64 return udf_obj._wrapped()
>  65 
>  66 
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in 
> _wrapped(self)
> 184 
> 185 wrapper.func = self.func
> --> 186 wrapper.returnType = self.returnType
> 187 wrapper.evalType = self.evalType
> 188 wrapper.deterministic = self.deterministic
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in 
> returnType(self)
> 117 raise NotImplementedError(
> 118 

[jira] [Created] (SPARK-23898) Simplify code generation for Add/Subtract with CalendarIntervals

2018-04-08 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-23898:
-

 Summary: Simplify code generation for Add/Subtract with 
CalendarIntervals
 Key: SPARK-23898
 URL: https://issues.apache.org/jira/browse/SPARK-23898
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Herman van Hovell
Assignee: Herman van Hovell






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23897) Guava version

2018-04-08 Thread Sercan Karaoglu (JIRA)
Sercan Karaoglu created SPARK-23897:
---

 Summary: Guava version
 Key: SPARK-23897
 URL: https://issues.apache.org/jira/browse/SPARK-23897
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Sercan Karaoglu


Guava dependency version 14 is pretty old, needs to be updated to at least 16, 
google cloud storage connector uses newer one which causes pretty popular error 
with guava; "java.lang.NoSuchMethodError: 
com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;"
 and causes app to crash



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23896) Improve PartitioningAwareFileIndex

2018-04-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23896:


Assignee: Apache Spark

> Improve PartitioningAwareFileIndex
> --
>
> Key: SPARK-23896
> URL: https://issues.apache.org/jira/browse/SPARK-23896
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Currently `PartitioningAwareFileIndex` accepts an optional parameter 
> `userPartitionSchema`. If provided, it will combine the inferred partition 
> schema with the parameter.
> However,
> 1. to get the inferred partition schema, we have to create a temporary file 
> index. 
> 2. to get `userPartitionSchema`, we need to  combine inferred partition 
> schema with `userSpecifiedSchema` 
> Only after that, a final version of `PartitioningAwareFileIndex` is created.
>  
> This can be improved by passing `userSpecifiedSchema` to 
> `PartitioningAwareFileIndex`.
> With the improvement, we can reduce redundant code and avoid parsing the file 
> partition twice. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23896) Improve PartitioningAwareFileIndex

2018-04-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429850#comment-16429850
 ] 

Apache Spark commented on SPARK-23896:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21004

> Improve PartitioningAwareFileIndex
> --
>
> Key: SPARK-23896
> URL: https://issues.apache.org/jira/browse/SPARK-23896
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently `PartitioningAwareFileIndex` accepts an optional parameter 
> `userPartitionSchema`. If provided, it will combine the inferred partition 
> schema with the parameter.
> However,
> 1. to get the inferred partition schema, we have to create a temporary file 
> index. 
> 2. to get `userPartitionSchema`, we need to  combine inferred partition 
> schema with `userSpecifiedSchema` 
> Only after that, a final version of `PartitioningAwareFileIndex` is created.
>  
> This can be improved by passing `userSpecifiedSchema` to 
> `PartitioningAwareFileIndex`.
> With the improvement, we can reduce redundant code and avoid parsing the file 
> partition twice. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23896) Improve PartitioningAwareFileIndex

2018-04-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23896:


Assignee: (was: Apache Spark)

> Improve PartitioningAwareFileIndex
> --
>
> Key: SPARK-23896
> URL: https://issues.apache.org/jira/browse/SPARK-23896
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently `PartitioningAwareFileIndex` accepts an optional parameter 
> `userPartitionSchema`. If provided, it will combine the inferred partition 
> schema with the parameter.
> However,
> 1. to get the inferred partition schema, we have to create a temporary file 
> index. 
> 2. to get `userPartitionSchema`, we need to  combine inferred partition 
> schema with `userSpecifiedSchema` 
> Only after that, a final version of `PartitioningAwareFileIndex` is created.
>  
> This can be improved by passing `userSpecifiedSchema` to 
> `PartitioningAwareFileIndex`.
> With the improvement, we can reduce redundant code and avoid parsing the file 
> partition twice. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23896) Improve PartitioningAwareFileIndex

2018-04-08 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-23896:
--

 Summary: Improve PartitioningAwareFileIndex
 Key: SPARK-23896
 URL: https://issues.apache.org/jira/browse/SPARK-23896
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Gengliang Wang


Currently `PartitioningAwareFileIndex` accepts an optional parameter 
`userPartitionSchema`. If provided, it will combine the inferred partition 
schema with the parameter.

However,

1. to get the inferred partition schema, we have to create a temporary file 
index. 

2. to get `userPartitionSchema`, we need to  combine inferred partition schema 
with `userSpecifiedSchema` 

Only after that, a final version of `PartitioningAwareFileIndex` is created.

 

This can be improved by passing `userSpecifiedSchema` to 
`PartitioningAwareFileIndex`.

With the improvement, we can reduce redundant code and avoid parsing the file 
partition twice. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23893) Possible overflow in long = int * int

2018-04-08 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-23893.
---
   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 2.4.0

> Possible overflow in long = int * int
> -
>
> Key: SPARK-23893
> URL: https://issues.apache.org/jira/browse/SPARK-23893
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.4.0
>
>
> To perform `int * int` and then to cast to `long` may cause overflow if the 
> MSB of the multiplication result is `1`. In other words, the result would be 
> negative due to sign extension.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23892) Improve coverage and fix lint error in UTF8String-related Suite

2018-04-08 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-23892.
---
   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 2.4.0

> Improve coverage and fix lint error in UTF8String-related Suite
> ---
>
> Key: SPARK-23892
> URL: https://issues.apache.org/jira/browse/SPARK-23892
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.4.0
>
>
> The following code in {{UTF8StringSuite}} has no sense.
> {code}
> assertTrue(s1.startsWith(s1));
> assertTrue(s1.endsWith(s1));
> {code}
> The code {{if (length <= 0) ""}} in {{UTF8StringPropertyCheckSuite}} has no 
> sense
> {code}
>   test("lpad, rpad") {
> def padding(origin: String, pad: String, length: Int, isLPad: Boolean): 
> String = {
>   if (length <= 0) return ""
>   if (length <= origin.length) {
> if (length <= 0) "" else origin.substring(0, length)
>   } else {
>...
> {code}
> The previous change in {{UTF8StringSuite}} broke lint-java check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2018-04-08 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: (was: login.controller.js)

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: test.JPG, test1.JPG, test2.JPG
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2018-04-08 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: login.controller.js

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: login.controller.js, test.JPG, test1.JPG, test2.JPG
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23895) Job continues to run even though some tasks have been failed

2018-04-08 Thread Huiqiang Liu (JIRA)
Huiqiang Liu created SPARK-23895:


 Summary: Job continues to run even though some tasks have been 
failed
 Key: SPARK-23895
 URL: https://issues.apache.org/jira/browse/SPARK-23895
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
 Environment: Ubuntu 14.04.4 LTS

Spark standalone deployment
Reporter: Huiqiang Liu


We are using com.databricks.spark-redshift to write dataframe into Redshift 
table. Internally, it writes dataframe to S3 first, and then executes query to 
load data from S3 to the final Redshift table. The problem occurred in the 
writing S3 phase, one executor was down due to JVM issue but the whole job was 
considered as success. It continued to run query to load the incomplete data 
from S3 to Redshift, which resulted in data loss.

The executor log:

{{8/04/01 15:06:25 INFO MemoryStore: Block broadcast_664 stored as values in 
memory (estimated size 114.6 KB, free 63.3 MB)}}
{{18/04/01 15:06:25 INFO MapOutputTrackerWorker: Don't have map outputs for 
shuffle 11, fetching them}}
{{18/04/01 15:06:25 INFO MapOutputTrackerWorker: Doing the fetch; tracker 
endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@172.19.103.84:43248)}}
{{18/04/01 15:06:25 INFO MapOutputTrackerWorker: Got the output locations}}
{{18/04/01 15:06:25 INFO ShuffleBlockFetcherIterator: Getting 200 non-empty 
blocks out of 200 blocks}}
{{18/04/01 15:06:25 INFO ShuffleBlockFetcherIterator: Started 15 remote fetches 
in 4 ms}}
{{18/04/01 15:06:25 INFO DefaultWriterContainer: Using output committer class 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter}}
{{Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
os::commit_memory(0x7f823e54e000, 65536, 1) failed; error='Cannot allocate 
memory' (errno=12)}}

 

The driver log:

{{18/04/01 15:06:39 INFO DAGScheduler: ShuffleMapStage 29 (mapPartitions at 
RedshiftWriter.scala:237) finished in 1.579 s}}
{{18/04/01 15:06:39 INFO DAGScheduler: looking for newly runnable stages}}
{{18/04/01 15:06:39 INFO DAGScheduler: running: Set()}}
{{18/04/01 15:06:39 INFO DAGScheduler: waiting: Set(ResultStage 30)}}
{{18/04/01 15:06:39 INFO DAGScheduler: failed: Set()}}
{{18/04/01 15:06:39 INFO DAGScheduler: Submitting ResultStage 30 
(MapPartitionsRDD[749] at createDataFrame at RedshiftWriter.scala:275), which 
has no missing parents}}
{{18/04/01 15:06:39 INFO MemoryStore: Block broadcast_667 stored as values in 
memory (estimated size 114.6 KB, free 160.1 MB)}}
{{18/04/01 15:06:39 INFO MemoryStore: Block broadcast_667_piece0 stored as 
bytes in memory (estimated size 44.1 KB, free 160.2 MB)}}
{{18/04/01 15:06:39 INFO BlockManagerInfo: Added broadcast_667_piece0 in memory 
on 172.19.103.84:18128 (size: 44.1 KB, free: 492.3 MB)}}
{{18/04/01 15:06:39 INFO SparkContext: Created broadcast 667 from broadcast at 
DAGScheduler.scala:1006}}
{{18/04/01 15:06:39 INFO DAGScheduler: Submitting 5 missing tasks from 
ResultStage 30 (MapPartitionsRDD[749] at createDataFrame at 
RedshiftWriter.scala:275)}}
{{18/04/01 15:06:39 INFO TaskSchedulerImpl: Adding task set 30.1 with 5 tasks}}
{{18/04/01 15:06:39 INFO TaskSetManager: Starting task 0.0 in stage 30.1 (TID 
5529, ip-172-19-103-87.ec2.internal, partition 3,PROCESS_LOCAL, 2061 bytes)}}
{{18/04/01 15:06:39 INFO TaskSetManager: Starting task 1.0 in stage 30.1 (TID 
5530, ip-172-19-105-221.ec2.internal, partition 6,PROCESS_LOCAL, 2061 bytes)}}
{{18/04/01 15:06:39 INFO TaskSetManager: Starting task 2.0 in stage 30.1 (TID 
5531, ip-172-19-101-76.ec2.internal, partition 11,PROCESS_LOCAL, 2061 bytes)}}
{{18/04/01 15:06:39 INFO TaskSetManager: Starting task 3.0 in stage 30.1 (TID 
5532, ip-172-19-103-87.ec2.internal, partition 13,PROCESS_LOCAL, 2061 bytes)}}
{{18/04/01 15:06:39 INFO TaskSetManager: Starting task 4.0 in stage 30.1 (TID 
5533, ip-172-19-105-117.ec2.internal, partition 14,PROCESS_LOCAL, 2061 bytes)}}
{{18/04/01 15:06:39 INFO BlockManagerInfo: Added broadcast_667_piece0 in memory 
on ip-172-19-101-76.ec2.internal:16864 (size: 44.1 KB, free: 1928.8 MB)}}
{{18/04/01 15:06:39 INFO BlockManagerInfo: Added broadcast_667_piece0 in memory 
on ip-172-19-103-87.ec2.internal:62681 (size: 44.1 KB, free: 1929.9 MB)}}
{{18/04/01 15:06:39 INFO BlockManagerInfo: Added broadcast_667_piece0 in memory 
on ip-172-19-105-221.ec2.internal:52999 (size: 44.1 KB, free: 1937.7 MB)}}
{{18/04/01 15:06:39 INFO BlockManagerInfo: Added broadcast_667_piece0 in memory 
on ip-172-19-105-117.ec2.internal:45766 (size: 44.1 KB, free: 1929.6 MB)}}
{{18/04/01 15:06:39 INFO BlockManagerInfo: Added broadcast_667_piece0 in memory 
on ip-172-19-103-87.ec2.internal:13372 (size: 44.1 KB, free: 1974.1 MB)}}
{{18/04/01 15:06:39 INFO MapOutputTrackerMasterEndpoint: Asked to send map 
output locations for shuffle 11 to ip-172-19-101-76.ec2.internal:44931}}
{{18/04/01 15:06:39 INFO MapOutputTrackerMaster: Size of output