date:20171122

[jira] [Updated] (SPARK-22431) Creating Permanent view with illegal type

2017-11-22 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22431:

Target Version/s: 2.3.0

> Creating Permanent view with illegal type
> -
>
> Key: SPARK-22431
> URL: https://issues.apache.org/jira/browse/SPARK-22431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>
> It is possible in Spark SQL to create a permanent view that uses an nested 
> field with an illegal name.
> For example if we create the following view:
> {noformat}
> create view x as select struct('a' as `$q`, 1 as b) q
> {noformat}
> A simple select fails with the following exception:
> {noformat}
> select * from x;
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct<$q:string,b:int>
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
> ...
> {noformat}
> Dropping the view isn't possible either:
> {noformat}
> drop view x;
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct<$q:string,b:int>
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2017-11-22 Thread Saanvi Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saanvi Sharma updated SPARK-22588:
--
Description: 
I am using spark 2.1 on EMR and i have a dataframe like this:

 ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
 14 |A |B|   C  |   null
 19 |X |Y|  null|   null
 21 |R |   null  |  null|   null
I want to load data into DynamoDB table with ClientNum as key fetching:

Analyze Your Data on Amazon DynamoDB with apche Spark11

Using Spark SQL for ETL3

here is my code that I tried to solve:

  var jobConf = new JobConf(sc.hadoopConfiguration)
  jobConf.set("dynamodb.servicename", "dynamodb")
  jobConf.set("dynamodb.input.tableName", "table_name")   
  jobConf.set("dynamodb.output.tableName", "table_name")   
  jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
  jobConf.set("dynamodb.regionid", "eu-west-1")
  jobConf.set("dynamodb.throughput.read", "1")
  jobConf.set("dynamodb.throughput.read.percent", "1")
  jobConf.set("dynamodb.throughput.write", "1")
  jobConf.set("dynamodb.throughput.write.percent", "1")
  
  jobConf.set("mapred.output.format.class", 
"org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
  jobConf.set("mapred.input.format.class", 
"org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")

  #Import Data
  val df = sqlContext.read.format("com.databricks.spark.csv").option("header", 
"true").option("inferSchema", "true").load(path)
I performed a transformation to have an RDD that matches the types that the 
DynamoDB custom output format knows how to write. The custom output format 
expects a tuple containing the Text and DynamoDBItemWritable types.

Create a new RDD with those types in it, in the following map call:

  #Convert the dataframe to rdd
  val df_rdd = df.rdd
  > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
MapPartitionsRDD[10] at rdd at :41
  
  #Print first rdd
  df_rdd.take(1)
  > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])

  var ddbInsertFormattedRDD = df_rdd.map(a => {
  var ddbMap = new HashMap[String, AttributeValue]()

  var ClientNum = new AttributeValue()
  ClientNum.setN(a.get(0).toString)
  ddbMap.put("ClientNum", ClientNum)

  var Value_1 = new AttributeValue()
  Value_1.setS(a.get(1).toString)
  ddbMap.put("Value_1", Value_1)

  var Value_2 = new AttributeValue()
  Value_2.setS(a.get(2).toString)
  ddbMap.put("Value_2", Value_2)

  var Value_3 = new AttributeValue()
  Value_3.setS(a.get(3).toString)
  ddbMap.put("Value_3", Value_3)

  var Value_4 = new AttributeValue()
  Value_4.setS(a.get(4).toString)
  ddbMap.put("Value_4", Value_4)

  var item = new DynamoDBItemWritable()
  item.setItem(ddbMap)

  (new Text(""), item)
  })
This last call uses the job configuration that defines the EMR-DDB connector to 
write out the new RDD you created in the expected format:

ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
fails with the follwoing error:

Caused by: java.lang.NullPointerException
null values caused the error, if I try with ClientNum and Value_1 it works data 
is correctly inserted on DynamoDB table.

Thanks for your help !!

  was:
I am using spark 2.1 on EMR and i have a dataframe like this:

 ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
 14 |A |B|   C  |   null
 19 |X |Y|  null|   null
 21 |R |   null  |  null|   null
I want to load data into DynamoDB table with ClientNum as key fetching:

Analyze Your Data on Amazon DynamoDB with [https://mindmajix.com/scala-training 
apche] Spark11

Using Spark SQL for ETL3

here is my code that I tried to solve:

  var jobConf = new JobConf(sc.hadoopConfiguration)
  jobConf.set("dynamodb.servicename", "dynamodb")
  jobConf.set("dynamodb.input.tableName", "table_name")   
  jobConf.set("dynamodb.output.tableName", "table_name")   
  jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
  jobConf.set("dynamodb.regionid", "eu-west-1")
  jobConf.set("dynamodb.throughput.read", "1")
  jobConf.set("dynamodb.throughput.read.percent", "1")
  jobConf.set("dynamodb.throughput.write", "1")
  jobConf.set("dynamodb.throughput.write.percent", "1")
  
  jobConf.set("mapred.output.format.class", 
"org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
  jobConf.set("mapred.input.format.class", 
"org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")

  #Import Data
  val df = sqlContext.read.format("com.databricks.spark.csv").option("header", 
"true").option("inferSchema", "true").load(path)
I performed a transformation to have an RDD that matches the types that the 
DynamoDB custom output format knows how to write. The custom output format 
expects a tuple containing the Text and DynamoDBItemWritable types.

Create a new RDD with those types in it, in the following map call:

  #Convert the

[jira] [Updated] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2017-11-22 Thread Saanvi Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saanvi Sharma updated SPARK-22588:
--
Description: 
I am using spark 2.1 on EMR and i have a dataframe like this:

 ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
 14 |A |B|   C  |   null
 19 |X |Y|  null|   null
 21 |R |   null  |  null|   null
I want to load data into DynamoDB table with ClientNum as key fetching:

Analyze Your Data on Amazon DynamoDB with [https://mindmajix.com/scala-training 
apche] Spark11

Using Spark SQL for ETL3

here is my code that I tried to solve:

  var jobConf = new JobConf(sc.hadoopConfiguration)
  jobConf.set("dynamodb.servicename", "dynamodb")
  jobConf.set("dynamodb.input.tableName", "table_name")   
  jobConf.set("dynamodb.output.tableName", "table_name")   
  jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
  jobConf.set("dynamodb.regionid", "eu-west-1")
  jobConf.set("dynamodb.throughput.read", "1")
  jobConf.set("dynamodb.throughput.read.percent", "1")
  jobConf.set("dynamodb.throughput.write", "1")
  jobConf.set("dynamodb.throughput.write.percent", "1")
  
  jobConf.set("mapred.output.format.class", 
"org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
  jobConf.set("mapred.input.format.class", 
"org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")

  #Import Data
  val df = sqlContext.read.format("com.databricks.spark.csv").option("header", 
"true").option("inferSchema", "true").load(path)
I performed a transformation to have an RDD that matches the types that the 
DynamoDB custom output format knows how to write. The custom output format 
expects a tuple containing the Text and DynamoDBItemWritable types.

Create a new RDD with those types in it, in the following map call:

  #Convert the dataframe to rdd
  val df_rdd = df.rdd
  > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
MapPartitionsRDD[10] at rdd at :41
  
  #Print first rdd
  df_rdd.take(1)
  > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])

  var ddbInsertFormattedRDD = df_rdd.map(a => {
  var ddbMap = new HashMap[String, AttributeValue]()

  var ClientNum = new AttributeValue()
  ClientNum.setN(a.get(0).toString)
  ddbMap.put("ClientNum", ClientNum)

  var Value_1 = new AttributeValue()
  Value_1.setS(a.get(1).toString)
  ddbMap.put("Value_1", Value_1)

  var Value_2 = new AttributeValue()
  Value_2.setS(a.get(2).toString)
  ddbMap.put("Value_2", Value_2)

  var Value_3 = new AttributeValue()
  Value_3.setS(a.get(3).toString)
  ddbMap.put("Value_3", Value_3)

  var Value_4 = new AttributeValue()
  Value_4.setS(a.get(4).toString)
  ddbMap.put("Value_4", Value_4)

  var item = new DynamoDBItemWritable()
  item.setItem(ddbMap)

  (new Text(""), item)
  })
This last call uses the job configuration that defines the EMR-DDB connector to 
write out the new RDD you created in the expected format:

ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
fails with the follwoing error:

Caused by: java.lang.NullPointerException
null values caused the error, if I try with ClientNum and Value_1 it works data 
is correctly inserted on DynamoDB table.

Thanks for your help !!

  was:
I am using spark 2.1 on EMR and i have a dataframe like this:

 ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
 14 |A |B|   C  |   null
 19 |X |Y|  null|   null
 21 |R |   null  |  null|   null
I want to load data into DynamoDB table with ClientNum as key fetching:

Analyze Your Data on Amazon DynamoDB with 
[https://mindmajix.com/scala-training](apache) Spark11

Using Spark SQL for ETL3

here is my code that I tried to solve:

  var jobConf = new JobConf(sc.hadoopConfiguration)
  jobConf.set("dynamodb.servicename", "dynamodb")
  jobConf.set("dynamodb.input.tableName", "table_name")   
  jobConf.set("dynamodb.output.tableName", "table_name")   
  jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
  jobConf.set("dynamodb.regionid", "eu-west-1")
  jobConf.set("dynamodb.throughput.read", "1")
  jobConf.set("dynamodb.throughput.read.percent", "1")
  jobConf.set("dynamodb.throughput.write", "1")
  jobConf.set("dynamodb.throughput.write.percent", "1")
  
  jobConf.set("mapred.output.format.class", 
"org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
  jobConf.set("mapred.input.format.class", 
"org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")

  #Import Data
  val df = sqlContext.read.format("com.databricks.spark.csv").option("header", 
"true").option("inferSchema", "true").load(path)
I performed a transformation to have an RDD that matches the types that the 
DynamoDB custom output format knows how to write. The custom output format 
expects a tuple containing the Text and DynamoDBItemWritable types.

Create a new RDD with those types in it,

[jira] [Updated] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2017-11-22 Thread Saanvi Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saanvi Sharma updated SPARK-22588:
--
Description: 
I am using spark 2.1 on EMR and i have a dataframe like this:

 ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
 14 |A |B|   C  |   null
 19 |X |Y|  null|   null
 21 |R |   null  |  null|   null
I want to load data into DynamoDB table with ClientNum as key fetching:

Analyze Your Data on Amazon DynamoDB with 
[https://mindmajix.com/scala-training](apache) Spark11

Using Spark SQL for ETL3

here is my code that I tried to solve:

  var jobConf = new JobConf(sc.hadoopConfiguration)
  jobConf.set("dynamodb.servicename", "dynamodb")
  jobConf.set("dynamodb.input.tableName", "table_name")   
  jobConf.set("dynamodb.output.tableName", "table_name")   
  jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
  jobConf.set("dynamodb.regionid", "eu-west-1")
  jobConf.set("dynamodb.throughput.read", "1")
  jobConf.set("dynamodb.throughput.read.percent", "1")
  jobConf.set("dynamodb.throughput.write", "1")
  jobConf.set("dynamodb.throughput.write.percent", "1")
  
  jobConf.set("mapred.output.format.class", 
"org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
  jobConf.set("mapred.input.format.class", 
"org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")

  #Import Data
  val df = sqlContext.read.format("com.databricks.spark.csv").option("header", 
"true").option("inferSchema", "true").load(path)
I performed a transformation to have an RDD that matches the types that the 
DynamoDB custom output format knows how to write. The custom output format 
expects a tuple containing the Text and DynamoDBItemWritable types.

Create a new RDD with those types in it, in the following map call:

  #Convert the dataframe to rdd
  val df_rdd = df.rdd
  > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
MapPartitionsRDD[10] at rdd at :41
  
  #Print first rdd
  df_rdd.take(1)
  > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])

  var ddbInsertFormattedRDD = df_rdd.map(a => {
  var ddbMap = new HashMap[String, AttributeValue]()

  var ClientNum = new AttributeValue()
  ClientNum.setN(a.get(0).toString)
  ddbMap.put("ClientNum", ClientNum)

  var Value_1 = new AttributeValue()
  Value_1.setS(a.get(1).toString)
  ddbMap.put("Value_1", Value_1)

  var Value_2 = new AttributeValue()
  Value_2.setS(a.get(2).toString)
  ddbMap.put("Value_2", Value_2)

  var Value_3 = new AttributeValue()
  Value_3.setS(a.get(3).toString)
  ddbMap.put("Value_3", Value_3)

  var Value_4 = new AttributeValue()
  Value_4.setS(a.get(4).toString)
  ddbMap.put("Value_4", Value_4)

  var item = new DynamoDBItemWritable()
  item.setItem(ddbMap)

  (new Text(""), item)
  })
This last call uses the job configuration that defines the EMR-DDB connector to 
write out the new RDD you created in the expected format:

ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
fails with the follwoing error:

Caused by: java.lang.NullPointerException
null values caused the error, if I try with ClientNum and Value_1 it works data 
is correctly inserted on DynamoDB table.

Thanks for your help !!

  was:
I am using spark 2.1 on EMR and i have a dataframe like this:

 ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
 14 |A |B|   C  |   null
 19 |X |Y|  null|   null
 21 |R |   null  |  null|   null
I want to load data into DynamoDB table with ClientNum as key fetching:

Analyze Your Data on Amazon DynamoDB with 
[https://mindmajix.com/scala-training] Spark11

Using Spark SQL for ETL3

here is my code that I tried to solve:

  var jobConf = new JobConf(sc.hadoopConfiguration)
  jobConf.set("dynamodb.servicename", "dynamodb")
  jobConf.set("dynamodb.input.tableName", "table_name")   
  jobConf.set("dynamodb.output.tableName", "table_name")   
  jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
  jobConf.set("dynamodb.regionid", "eu-west-1")
  jobConf.set("dynamodb.throughput.read", "1")
  jobConf.set("dynamodb.throughput.read.percent", "1")
  jobConf.set("dynamodb.throughput.write", "1")
  jobConf.set("dynamodb.throughput.write.percent", "1")
  
  jobConf.set("mapred.output.format.class", 
"org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
  jobConf.set("mapred.input.format.class", 
"org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")

  #Import Data
  val df = sqlContext.read.format("com.databricks.spark.csv").option("header", 
"true").option("inferSchema", "true").load(path)
I performed a transformation to have an RDD that matches the types that the 
DynamoDB custom output format knows how to write. The custom output format 
expects a tuple containing the Text and DynamoDBItemWritable types.

Create a new RDD with those types in it, in the

[jira] [Created] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2017-11-22 Thread Saanvi Sharma (JIRA)

Saanvi Sharma created SPARK-22588:
-

 Summary: SPARK: Load Data from Dataframe or RDD to DynamoDB / 
dealing with null values
 Key: SPARK-22588
 URL: https://issues.apache.org/jira/browse/SPARK-22588
 Project: Spark
  Issue Type: Question
  Components: Deploy
Affects Versions: 2.1.1
Reporter: Saanvi Sharma
Priority: Minor


I am using spark 2.1 on EMR and i have a dataframe like this:

 ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
 14 |A |B|   C  |   null
 19 |X |Y|  null|   null
 21 |R |   null  |  null|   null
I want to load data into DynamoDB table with ClientNum as key fetching:

Analyze Your Data on Amazon DynamoDB with 
[https://mindmajix.com/scala-training] Spark11

Using Spark SQL for ETL3

here is my code that I tried to solve:

  var jobConf = new JobConf(sc.hadoopConfiguration)
  jobConf.set("dynamodb.servicename", "dynamodb")
  jobConf.set("dynamodb.input.tableName", "table_name")   
  jobConf.set("dynamodb.output.tableName", "table_name")   
  jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
  jobConf.set("dynamodb.regionid", "eu-west-1")
  jobConf.set("dynamodb.throughput.read", "1")
  jobConf.set("dynamodb.throughput.read.percent", "1")
  jobConf.set("dynamodb.throughput.write", "1")
  jobConf.set("dynamodb.throughput.write.percent", "1")
  
  jobConf.set("mapred.output.format.class", 
"org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
  jobConf.set("mapred.input.format.class", 
"org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")

  #Import Data
  val df = sqlContext.read.format("com.databricks.spark.csv").option("header", 
"true").option("inferSchema", "true").load(path)
I performed a transformation to have an RDD that matches the types that the 
DynamoDB custom output format knows how to write. The custom output format 
expects a tuple containing the Text and DynamoDBItemWritable types.

Create a new RDD with those types in it, in the following map call:

  #Convert the dataframe to rdd
  val df_rdd = df.rdd
  > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
MapPartitionsRDD[10] at rdd at :41
  
  #Print first rdd
  df_rdd.take(1)
  > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])

  var ddbInsertFormattedRDD = df_rdd.map(a => {
  var ddbMap = new HashMap[String, AttributeValue]()

  var ClientNum = new AttributeValue()
  ClientNum.setN(a.get(0).toString)
  ddbMap.put("ClientNum", ClientNum)

  var Value_1 = new AttributeValue()
  Value_1.setS(a.get(1).toString)
  ddbMap.put("Value_1", Value_1)

  var Value_2 = new AttributeValue()
  Value_2.setS(a.get(2).toString)
  ddbMap.put("Value_2", Value_2)

  var Value_3 = new AttributeValue()
  Value_3.setS(a.get(3).toString)
  ddbMap.put("Value_3", Value_3)

  var Value_4 = new AttributeValue()
  Value_4.setS(a.get(4).toString)
  ddbMap.put("Value_4", Value_4)

  var item = new DynamoDBItemWritable()
  item.setItem(ddbMap)

  (new Text(""), item)
  })
This last call uses the job configuration that defines the EMR-DDB connector to 
write out the new RDD you created in the expected format:

ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
fails with the follwoing error:

Caused by: java.lang.NullPointerException
null values caused the error, if I try with ClientNum and Value_1 it works data 
is correctly inserted on DynamoDB table.

Thanks for your help !!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10848) Applied JSON Schema Works for json RDD but not when loading json file

2017-11-22 Thread Amit (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263867#comment-16263867
 ] 

Amit edited comment on SPARK-10848 at 11/23/17 6:22 AM:


This issue is still persistent in Spark 2.1.0.
I tried below steps  in Spark 2.1.0, it giving the same result as in the 
question, Please reopen the JIRA to get it tracked.


{code:java}
import  org.apache.spark.sql.types._

{code}


{code:java}
val jsonRdd = sc.parallelize(List(
  """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
"ProductCode": "WQT648", "Qty": 5}""",
  """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
"ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
"expressDelivery":true}"""))

{code}


{code:java}
val mySchema = StructType(Array(
  StructField(name="OrderID"   , dataType=LongType, nullable=false),
  StructField("CustomerID", IntegerType, false),
  StructField("OrderDate", DateType, false),
  StructField("ProductCode", StringType, false),
  StructField("Qty", IntegerType, false),
  StructField("Discount", FloatType, true),
  StructField("expressDelivery", BooleanType, true)
))

val myDF = spark.read.schema(mySchema).json(jsonRdd)
val schema1 = myDF.printSchema

val dfDFfromFile = spark.read.schema(mySchema).json("csvdatatest/Orders.json")
val schema2 = dfDFfromFile.printSchema
{code}



was (Author: amit1990):
This issue is still persistent in Spark 2.1.0.
I tried below steps and in Spark 2.1.0, it giving the same result as in the 
question, Please reopen the JIRA to get it tracked.

import  org.apache.spark.sql.types._


{code:java}
val jsonRdd = sc.parallelize(List(
  """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
"ProductCode": "WQT648", "Qty": 5}""",
  """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
"ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
"expressDelivery":true}"""))

{code}


{code:java}
val mySchema = StructType(Array(
  StructField(name="OrderID"   , dataType=LongType, nullable=false),
  StructField("CustomerID", IntegerType, false),
  StructField("OrderDate", DateType, false),
  StructField("ProductCode", StringType, false),
  StructField("Qty", IntegerType, false),
  StructField("Discount", FloatType, true),
  StructField("expressDelivery", BooleanType, true)
))

val myDF = spark.read.schema(mySchema).json(jsonRdd)
val schema1 = myDF.printSchema

val dfDFfromFile = spark.read.schema(mySchema).json("csvdatatest/Orders.json")
val schema2 = dfDFfromFile.printSchema
{code}


> Applied JSON Schema Works for json RDD but not when loading json file
> -
>
> Key: SPARK-10848
> URL: https://issues.apache.org/jira/browse/SPARK-10848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Miklos Christine
>Priority: Minor
>
> Using a defined schema to load a json rdd works as expected. Loading the json 
> records from a file does not apply the supplied schema. Mainly the nullable 
> field isn't applied correctly. Loading from a file uses nullable=true on all 
> fields regardless of applied schema. 
> Code to reproduce:
> {code}
> import  org.apache.spark.sql.types._
> val jsonRdd = sc.parallelize(List(
>   """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
> "ProductCode": "WQT648", "Qty": 5}""",
>   """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
> "ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
> "expressDelivery":true}"""))
> val mySchema = StructType(Array(
>   StructField(name="OrderID"   , dataType=LongType, nullable=false),
>   StructField("CustomerID", IntegerType, false),
>   StructField("OrderDate", DateType, false),
>   StructField("ProductCode", StringType, false),
>   StructField("Qty", IntegerType, false),
>   StructField("Discount", FloatType, true),
>   StructField("expressDelivery", BooleanType, true)
> ))
> val myDF = sqlContext.read.schema(mySchema).json(jsonRdd)
> val schema1 = myDF.printSchema
> val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json")
> val schema2 = dfDFfromFile.printSchema
> {code}
> Orders.json
> {code}
> {"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode": 
> "WQT648", "Qty": 5}
> {"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", "ProductCode": 
> "LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true}
> {code}
> The behavior should be consistent. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10848) Applied JSON Schema Works for json RDD but not when loading json file

2017-11-22 Thread Amit (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263867#comment-16263867
 ] 

Amit edited comment on SPARK-10848 at 11/23/17 6:21 AM:


This issue is still persistent in Spark 2.1.0.
I tried below steps and in Spark 2.1.0, it giving the same result as in the 
question, Please reopen the JIRA to get it tracked.

import  org.apache.spark.sql.types._


{code:java}
val jsonRdd = sc.parallelize(List(
  """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
"ProductCode": "WQT648", "Qty": 5}""",
  """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
"ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
"expressDelivery":true}"""))

{code}


{code:java}
val mySchema = StructType(Array(
  StructField(name="OrderID"   , dataType=LongType, nullable=false),
  StructField("CustomerID", IntegerType, false),
  StructField("OrderDate", DateType, false),
  StructField("ProductCode", StringType, false),
  StructField("Qty", IntegerType, false),
  StructField("Discount", FloatType, true),
  StructField("expressDelivery", BooleanType, true)
))

val myDF = spark.read.schema(mySchema).json(jsonRdd)
val schema1 = myDF.printSchema

val dfDFfromFile = spark.read.schema(mySchema).json("csvdatatest/Orders.json")
val schema2 = dfDFfromFile.printSchema
{code}



was (Author: amit1990):
This issue is still persistent in Spark 2.1.0.
I tried below steps and in Spark 2.1.0, it giving the same result as in the 
question, Please reopen the JIRA to get it tracked.

import  org.apache.spark.sql.types._

val jsonRdd = sc.parallelize(List(
  """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
"ProductCode": "WQT648", "Qty": 5}""",
  """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
"ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
"expressDelivery":true}"""))

val mySchema = StructType(Array(
  StructField(name="OrderID"   , dataType=LongType, nullable=false),
  StructField("CustomerID", IntegerType, false),
  StructField("OrderDate", DateType, false),
  StructField("ProductCode", StringType, false),
  StructField("Qty", IntegerType, false),
  StructField("Discount", FloatType, true),
  StructField("expressDelivery", BooleanType, true)
))

val myDF = spark.read.schema(mySchema).json(jsonRdd)
val schema1 = myDF.printSchema

val dfDFfromFile = spark.read.schema(mySchema).json("csvdatatest/Orders.json")
val schema2 = dfDFfromFile.printSchema

> Applied JSON Schema Works for json RDD but not when loading json file
> -
>
> Key: SPARK-10848
> URL: https://issues.apache.org/jira/browse/SPARK-10848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Miklos Christine
>Priority: Minor
>
> Using a defined schema to load a json rdd works as expected. Loading the json 
> records from a file does not apply the supplied schema. Mainly the nullable 
> field isn't applied correctly. Loading from a file uses nullable=true on all 
> fields regardless of applied schema. 
> Code to reproduce:
> {code}
> import  org.apache.spark.sql.types._
> val jsonRdd = sc.parallelize(List(
>   """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
> "ProductCode": "WQT648", "Qty": 5}""",
>   """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
> "ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
> "expressDelivery":true}"""))
> val mySchema = StructType(Array(
>   StructField(name="OrderID"   , dataType=LongType, nullable=false),
>   StructField("CustomerID", IntegerType, false),
>   StructField("OrderDate", DateType, false),
>   StructField("ProductCode", StringType, false),
>   StructField("Qty", IntegerType, false),
>   StructField("Discount", FloatType, true),
>   StructField("expressDelivery", BooleanType, true)
> ))
> val myDF = sqlContext.read.schema(mySchema).json(jsonRdd)
> val schema1 = myDF.printSchema
> val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json")
> val schema2 = dfDFfromFile.printSchema
> {code}
> Orders.json
> {code}
> {"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode": 
> "WQT648", "Qty": 5}
> {"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", "ProductCode": 
> "LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true}
> {code}
> The behavior should be consistent. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10848) Applied JSON Schema Works for json RDD but not when loading json file

2017-11-22 Thread Amit (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263867#comment-16263867
 ] 

Amit commented on SPARK-10848:
--

This issue is still persistent in Spark 2.1.0.
I tried below steps and in Spark 2.1.0, it giving the same result as in the 
question, Please reopen the JIRA to get it tracked.

import  org.apache.spark.sql.types._

val jsonRdd = sc.parallelize(List(
  """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
"ProductCode": "WQT648", "Qty": 5}""",
  """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
"ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
"expressDelivery":true}"""))

val mySchema = StructType(Array(
  StructField(name="OrderID"   , dataType=LongType, nullable=false),
  StructField("CustomerID", IntegerType, false),
  StructField("OrderDate", DateType, false),
  StructField("ProductCode", StringType, false),
  StructField("Qty", IntegerType, false),
  StructField("Discount", FloatType, true),
  StructField("expressDelivery", BooleanType, true)
))

val myDF = spark.read.schema(mySchema).json(jsonRdd)
val schema1 = myDF.printSchema

val dfDFfromFile = spark.read.schema(mySchema).json("csvdatatest/Orders.json")
val schema2 = dfDFfromFile.printSchema

> Applied JSON Schema Works for json RDD but not when loading json file
> -
>
> Key: SPARK-10848
> URL: https://issues.apache.org/jira/browse/SPARK-10848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Miklos Christine
>Priority: Minor
>
> Using a defined schema to load a json rdd works as expected. Loading the json 
> records from a file does not apply the supplied schema. Mainly the nullable 
> field isn't applied correctly. Loading from a file uses nullable=true on all 
> fields regardless of applied schema. 
> Code to reproduce:
> {code}
> import  org.apache.spark.sql.types._
> val jsonRdd = sc.parallelize(List(
>   """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
> "ProductCode": "WQT648", "Qty": 5}""",
>   """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
> "ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
> "expressDelivery":true}"""))
> val mySchema = StructType(Array(
>   StructField(name="OrderID"   , dataType=LongType, nullable=false),
>   StructField("CustomerID", IntegerType, false),
>   StructField("OrderDate", DateType, false),
>   StructField("ProductCode", StringType, false),
>   StructField("Qty", IntegerType, false),
>   StructField("Discount", FloatType, true),
>   StructField("expressDelivery", BooleanType, true)
> ))
> val myDF = sqlContext.read.schema(mySchema).json(jsonRdd)
> val schema1 = myDF.printSchema
> val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json")
> val schema2 = dfDFfromFile.printSchema
> {code}
> Orders.json
> {code}
> {"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode": 
> "WQT648", "Qty": 5}
> {"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", "ProductCode": 
> "LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true}
> {code}
> The behavior should be consistent. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url

2017-11-22 Thread Prabhu Joseph (JIRA)

Prabhu Joseph created SPARK-22587:
-

 Summary: Spark job fails if fs.defaultFS and application jar are 
different url
 Key: SPARK-22587
 URL: https://issues.apache.org/jira/browse/SPARK-22587
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 1.6.3
Reporter: Prabhu Joseph


Spark Job fails if the fs.defaultFs and url where application jar resides are 
different and having same scheme,

spark-submit  --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py

core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop fs 
-ls) works for both the url XXX and YYY.

{code}
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
wasb://XXX/tmp/test.py, expected: wasb://YYY 
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) 
at 
org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251)
 
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) 
at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) 
at 
org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507)
 
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) 
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912)
 
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) 
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) 
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) 
at org.apache.spark.deploy.yarn.Client.main(Client.scala) 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
at java.lang.reflect.Method.invoke(Method.java:498) 
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751)
 
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) 
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) 
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) 
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
{code}

The code Client.copyFileToRemote tries to resolve the path of application jar 
(XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead 
of the actual url of application jar.

val destFs = destDir.getFileSystem(hadoopConf)
val srcFs = srcPath.getFileSystem(hadoopConf)

getFileSystem will create the filesystem based on the url of the path and so 
this is fine. But the below lines of code tries to get the srcPath (XXX url) 
from the destFs (YYY url) and so it fails.

var destPath = srcPath
val qualifiedDestPath = destFs.makeQualified(destPath)






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22495) Fix setup of SPARK_HOME variable on Windows

2017-11-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22495.
--
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/19370

and needs a manual backport.

> Fix setup of SPARK_HOME variable on Windows
> ---
>
> Key: SPARK-22495
> URL: https://issues.apache.org/jira/browse/SPARK-22495
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Windows
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Jakub Nowacki
>Priority: Minor
> Fix For: 2.3.0
>
>
> On Windows, pip installed pyspark is unable to find out the spark home. There 
> is already proposed change, sufficient details and discussions in 
> https://github.com/apache/spark/pull/19370 and SPARK-18136



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22495) Fix setup of SPARK_HOME variable on Windows

2017-11-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22495:
-
Fix Version/s: 2.3.0

> Fix setup of SPARK_HOME variable on Windows
> ---
>
> Key: SPARK-22495
> URL: https://issues.apache.org/jira/browse/SPARK-22495
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Windows
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Jakub Nowacki
>Priority: Minor
> Fix For: 2.3.0
>
>
> On Windows, pip installed pyspark is unable to find out the spark home. There 
> is already proposed change, sufficient details and discussions in 
> https://github.com/apache/spark/pull/19370 and SPARK-18136



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22573) SQL Planner is including unnecessary columns in the projection

2017-11-22 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263745#comment-16263745
 ] 

Yuming Wang commented on SPARK-22573:
-

It caused by https://github.com/apache/spark/pull/16954.
You can work around it by {{set 
spark.sql.constraintPropagation.enabled=false}}, I will try to fix it soon.

> SQL Planner is including unnecessary columns in the projection
> --
>
> Key: SPARK-22573
> URL: https://issues.apache.org/jira/browse/SPARK-22573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Rajkishore Hembram
>
> While I was running TPC-H query 18 for benchmarking, I observed that the 
> query plan for Apache Spark 2.2.0 is inefficient than other versions of 
> Apache Spark. I noticed that the other versions of Apache Spark (2.0.2 and 
> 2.1.2) are only including the required columns in the projections. But the 
> query planner of Apache Spark 2.2.0 is including unnecessary columns into the 
> projection for some of the queries and hence unnecessarily increasing the 
> I/O. And because of that the Apache Spark 2.2.0 is taking more time.
> [Spark 2.1.2 TPC-H Query 18 
> Plan|https://drive.google.com/file/d/1_u8nPKG_SIM7P6fs0VK-8UEXIhWPY_BN/view]
> [Spark 2.2.0 TPC-H Query 18 
> Plan|https://drive.google.com/file/d/1xtxG5Ext36djfTDSdf_W5vGbbdgRApPo/view]
> TPC-H Query 18
> {code:java}
> select C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE,sum(L_QUANTITY) 
> from CUSTOMER,ORDERS,LINEITEM where O_ORDERKEY in ( select L_ORDERKEY from 
> LINEITEM group by L_ORDERKEY having sum(L_QUANTITY) > 300 ) and C_CUSTKEY = 
> O_CUSTKEY and O_ORDERKEY = L_ORDERKEY group by 
> C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE order by O_TOTALPRICE 
> desc,O_ORDERDATE
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22505) toDF() / createDataFrame() type inference doesn't work as expected

2017-11-22 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263733#comment-16263733
 ] 

Ruslan Dautkhanov commented on SPARK-22505:
---

that's great. thank you [~hyukjin.kwon]

> toDF() / createDataFrame() type inference doesn't work as expected
> --
>
> Key: SPARK-22505
> URL: https://issues.apache.org/jira/browse/SPARK-22505
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>  Labels: csvparser, inference, pyspark, schema, spark-sql
>
> {code}
> df = 
> sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str'])
> df.printSchema()
> {code}
> produces
> {noformat}
> root
>  |-- should_be_int: string (nullable = true)
>  |-- should_be_str: string (nullable = true)
> {noformat}
> Notice `should_be_int` has `string` datatype, according to documentation:
> https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
> {quote}
> Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the 
> datatypes. Rows are constructed by passing a list of key/value pairs as 
> kwargs to the Row class. The keys of this list define the column names of the 
> table, *and the types are inferred by sampling the whole dataset*, similar to 
> the inference that is performed on JSON files.
> {quote}
> Schema inference works as expected when reading delimited files like
> {code}
> spark.read.format('csv').option('inferSchema', True)...
> {code}
> but not when using toDF() / createDataFrame() API calls.
> Spark 2.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-22560) Must create spark session directly to connect to hive

2017-11-22 Thread Ran Mingxuan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Mingxuan reopened SPARK-22560:
--

My method not working. Need support.

> Must create spark session directly to connect to hive
> -
>
> Key: SPARK-22560
> URL: https://issues.apache.org/jira/browse/SPARK-22560
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ran Mingxuan
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In a java project I have to use both JavaSparkContext  and SparkSession. I 
> find the order to create them affect hive connection.
> I have built a spark job like below:
> {code:java}
> // wrong code
> public void main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
> SparkSession spark = 
> SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate();
> spark.sql("show databases").show();
> }
> {code}
> and with this code spark job will not be able to find hive meta-store even if 
> it can discover correct warehouse.
> I have to use code like below to make things work:
> {code:java}
> // correct code 
> public String main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> SparkSession spark = 
> SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate();
> SparkContext sparkContext = spark.sparkContext();
> JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext);
> spark.sql("show databases").show();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22553) Drop FROM in nonReserved

2017-11-22 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263712#comment-16263712
 ] 

Xiao Li commented on SPARK-22553:
-

We need a COMPATIBILITY config like 
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.1.0/com.ibm.db2.luw.apdv.porting.doc/doc/r0052867.html

This is the right direction, but I personally do not have a bandwidth to drive 
it.

> Drop FROM in nonReserved
> 
>
> Key: SPARK-22553
> URL: https://issues.apache.org/jira/browse/SPARK-22553
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Priority: Trivial
>
> A simple query below throws a misleading error because nonReserved has 
> `SELECT` in SqlBase.q4:
> {code}
> scala> Seq((1, 2)).toDF("a", "b").createTempView("t")
> scala> sql("select a, count(1), from t group by 1").show
> org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input 
> columns: []; line 1 pos 7;
> 'Aggregate [unresolvedordinal(1)], ['a, count(1) AS count(1)#13L, 'from AS 
> t#11]
> +- OneRowRelation$
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
> {code}
> I know nonReserved currently has `SELECT` because of the historical reason 
> (https://github.com/apache/spark/pull/18079#discussion_r118842186). But, 
> since IMHO this is a kind of common mistakes (This message annoyed me a few 
> days ago in large SQL queries...), it might be worth dropping it in the 
> reserved.
> FYI: In postgresql throws an explicit error in this case:
> {code}
> postgres=# select a, count(1), from test group by b;
> ERROR:  syntax error at or near "from" at character 21
> STATEMENT:  select a, count(1), from test group by b;
> ERROR:  syntax error at or near "from"
> LINE 1: select a, count(1), from test group by b;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22552) Cannot Union multiple kafka streams

2017-11-22 Thread sachin malhotra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sachin malhotra resolved SPARK-22552.
-
Resolution: Not A Problem

> Cannot Union multiple kafka streams
> ---
>
> Key: SPARK-22552
> URL: https://issues.apache.org/jira/browse/SPARK-22552
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: sachin malhotra
>Priority: Minor
>
> When unioning multiple kafka streams I learned that the resulting dataframe 
> only contains the data that exists in the dataframe that initiated the union 
> i.e. if df1.union(df2) (or a chaining of unions) the result will only contain 
> the rows that exist in df1.
> Now to be more specific this occurs when data comes in during the same 
> micro-batch for all three streams. If you wait for each single row to be 
> processed for each stream the union does return the right results. 
> For example, if you have 3 kafka streams and you:
> send message 1 to stream 1, WAIT for batch to finish, send message 2 to 
> stream 2, wait for batch to finish, send message 3 to stream 3, wait for 
> batch to finish. Union will return the right data.
> But if you,
> send message 1,2,3, WAIT for batch to finish, you only receive data in the 
> first stream when unioning all three dataframes



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22551) Fix 64kb compile error for common expression types

2017-11-22 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-22551.
-
Resolution: Not A Problem

> Fix 64kb compile error for common expression types
> --
>
> Key: SPARK-22551
> URL: https://issues.apache.org/jira/browse/SPARK-22551
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> For common expression types, such as {{BinaryExpression}} and 
> {{TernaryExpression}}, the combination of generated codes of children can 
> possibly be large. We should put the codes into functions to prevent possible 
> 64kb compile error.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22551) Fix 64kb compile error for common expression types

2017-11-22 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263696#comment-16263696
 ] 

Liang-Chi Hsieh commented on SPARK-22551:
-

After SPARK-22543 is merged, I can't reproduce this issue. Hopefully it also 
solves this, I close this as not a problem for now.

> Fix 64kb compile error for common expression types
> --
>
> Key: SPARK-22551
> URL: https://issues.apache.org/jira/browse/SPARK-22551
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> For common expression types, such as {{BinaryExpression}} and 
> {{TernaryExpression}}, the combination of generated codes of children can 
> possibly be large. We should put the codes into functions to prevent possible 
> 64kb compile error.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22240) S3 CSV number of partitions incorrectly computed

2017-11-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263690#comment-16263690
 ] 

Hyukjin Kwon commented on SPARK-22240:
--

Sure, sounds good !

> S3 CSV number of partitions incorrectly computed
> 
>
> Key: SPARK-22240
> URL: https://issues.apache.org/jira/browse/SPARK-22240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Running on EMR 5.8.0 with Hadoop 2.7.3 and Spark 2.2.0
>Reporter: Arthur Baudry
>
> Reading CSV out of S3 using S3A protocol does not compute the number of 
> partitions correctly in Spark 2.2.0.
> With Spark 2.2.0 I get only partition when loading a 14GB file
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 1
> {code}
> While in Spark 2.0.2 I had:
> {code:java}
> scala> val input = spark.read.format("csv").option("header", 
> "true").option("delimiter", "|").option("multiLine", 
> "true").load("s3a://")
> input: org.apache.spark.sql.DataFrame = [PARTY_KEY: string, ROW_START_DATE: 
> string ... 36 more fields]
> scala> input.rdd.getNumPartitions
> res2: Int = 115
> {code}
> This introduces obvious performance issues in Spark 2.2.0. Maybe there is a 
> property that should be set to have the number of partitions computed 
> correctly.
> I'm aware that the .option("multiline","true") is not supported in Spark 
> 2.0.2, it's not relevant here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22165) Type conflicts between dates, timestamps and date in partition column

2017-11-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263686#comment-16263686
 ] 

Hyukjin Kwon commented on SPARK-22165:
--

[~cloud_fan], BTW, should we maybe leave a release-notes tag too?

> Type conflicts between dates, timestamps and date in partition column
> -
>
> Key: SPARK-22165
> URL: https://issues.apache.org/jira/browse/SPARK-22165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0, 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> It looks we have some bugs when resolving type conflicts in partition column. 
> I found few corner cases as below:
> Case 1: timestamp should be inferred but date type is inferred.
> {code}
> val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
> df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
> spark.read.load("/tmp/foo").printSchema()
> {code}
> {code}
> root
>  |-- i: integer (nullable = true)
>  |-- ts: date (nullable = true)
> {code}
> Case 2: decimal should be inferred but integer is inferred.
> {code}
> val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
> df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
> spark.read.load("/tmp/bar").printSchema()
> {code}
> {code}
> root
>  |-- i: integer (nullable = true)
>  |-- decimal: integer (nullable = true)
> {code}
> Looks we should de-duplicate type resolution logic if possible rather than 
> separate numeric precedence-like comparison alone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22505) toDF() / createDataFrame() type inference doesn't work as expected

2017-11-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263684#comment-16263684
 ] 

Hyukjin Kwon commented on SPARK-22505:
--

Ah, BTW for RDD one, Spark has {{def csv(csvDataset: Dataset[String])}} for 
Scala, which you'd be able to call like {{spark.read.csv(rdd.toDS)}} from 2.2.0 
and PySpark has it from 2.3.0 - SPARK-22112

> toDF() / createDataFrame() type inference doesn't work as expected
> --
>
> Key: SPARK-22505
> URL: https://issues.apache.org/jira/browse/SPARK-22505
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>  Labels: csvparser, inference, pyspark, schema, spark-sql
>
> {code}
> df = 
> sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str'])
> df.printSchema()
> {code}
> produces
> {noformat}
> root
>  |-- should_be_int: string (nullable = true)
>  |-- should_be_str: string (nullable = true)
> {noformat}
> Notice `should_be_int` has `string` datatype, according to documentation:
> https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
> {quote}
> Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the 
> datatypes. Rows are constructed by passing a list of key/value pairs as 
> kwargs to the Row class. The keys of this list define the column names of the 
> table, *and the types are inferred by sampling the whole dataset*, similar to 
> the inference that is performed on JSON files.
> {quote}
> Schema inference works as expected when reading delimited files like
> {code}
> spark.read.format('csv').option('inferSchema', True)...
> {code}
> but not when using toDF() / createDataFrame() API calls.
> Spark 2.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-11-22 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263589#comment-16263589
 ] 

Reynold Xin commented on SPARK-21866:
-

Why not just declare an image function that loads the image data source?
The function will throw an exception if one cannot be loaded.

On Thu, Nov 23, 2017 at 7:53 AM Joseph K. Bradley (JIRA) 



> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>Assignee: Ilya Matiach
>  Labels: SPIP
> Fix For: 2.3.0
>
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of

[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-11-22 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263582#comment-16263582
 ] 

Joseph K. Bradley commented on SPARK-21866:
---

As far as I know, it shouldn't be a problem.  The new datasource can be in 
{{mllib}} since the datasource API permits custom datasources.  People will be 
able to write {{spark.read.format("image")}}, though they won't be able to 
write {{spark.read.image(...)}}.  E.g., 
https://github.com/databricks/spark-avro lives outside of {{sql}}.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Fix For: 2.3.0
>
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in

[jira] [Assigned] (SPARK-21866) SPIP: Image support in Spark

2017-11-22 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-21866:
-

Assignee: Ilya Matiach

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>Assignee: Ilya Matiach
>  Labels: SPIP
> Fix For: 2.3.0
>
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin",

[jira] [Resolved] (SPARK-21866) SPIP: Image support in Spark

2017-11-22 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21866.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19439
[https://github.com/apache/spark/pull/19439]

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Fix For: 2.3.0
>
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the

[jira] [Commented] (SPARK-22267) Spark SQL incorrectly reads ORC file when column order is different

2017-11-22 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263548#comment-16263548
 ] 

Dongjoon Hyun commented on SPARK-22267:
---

[~mpetruska]. If you have a patch, please feel free to proceed 
`convertMetastoreOrc=false` case. Thanks.

> Spark SQL incorrectly reads ORC file when column order is different
> ---
>
> Key: SPARK-22267
> URL: https://issues.apache.org/jira/browse/SPARK-22267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>
> For a long time, Apache Spark SQL returns incorrect results when ORC file 
> schema is different from metastore schema order.
> {code}
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("parquet").mode("overwrite").save("/tmp/p")
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save("/tmp/o")
> scala> sql("CREATE EXTERNAL TABLE p(c2 INT, c1 INT) STORED AS parquet 
> LOCATION '/tmp/p'")
> scala> sql("CREATE EXTERNAL TABLE o(c2 INT, c1 INT) STORED AS orc LOCATION 
> '/tmp/o'")
> scala> spark.table("p").show  // Parquet is good.
> +---+---+
> | c2| c1|
> +---+---+
> |  2|  1|
> +---+---+
> scala> spark.table("o").show// This is wrong.
> +---+---+
> | c2| c1|
> +---+---+
> |  1|  2|
> +---+---+
> scala> spark.read.orc("/tmp/o").show  // This is correct.
> +---+---+
> | c1| c2|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> *TESTCASE*
> {code}
>   test("SPARK-22267 Spark SQL incorrectly reads ORC files when column order 
> is different") {
> withTempDir { dir =>
>   val path = dir.getCanonicalPath
>   Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save(path)
>   checkAnswer(spark.read.orc(path), Row(1, 2))
>   Seq("true", "false").foreach { value =>
> withTable("t") {
>   withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
> sql(s"CREATE EXTERNAL TABLE t(c2 INT, c1 INT) STORED AS ORC 
> LOCATION '$path'")
> checkAnswer(spark.table("t"), Row(2, 1))
>   }
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22585) Url encoding of jar path expected?

2017-11-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263512#comment-16263512
 ] 

Sean Owen commented on SPARK-22585:
---

Does the real file name contain "%3A443" or ":443"? 

> Url encoding of jar path expected?
> --
>
> Key: SPARK-22585
> URL: https://issues.apache.org/jira/browse/SPARK-22585
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Jakub Dubovsky
>
> I am calling {code}sparkContext.addJar{code} method with path to a local jar 
> I want to add. Example:
> {code}/home/me/.coursier/cache/v1/https/artifactory.com%3A443/path/to.jar{code}.
>  As a result I get an exception saying
> {code}
> Failed to add 
> /home/me/.coursier/cache/v1/https/artifactory.com%3A443/path/to.jar to Spark 
> environment. Stacktrace:
> java.io.FileNotFoundException: Jar 
> /home/me/.coursier/cache/v1/https/artifactory.com:443/path/to.jar not found
> {code}
> Important part to notice here is that colon character is url encoded in path 
> I want to use but exception is complaining about path in decoded form. This 
> is caused by this line of code from implementation ([see 
> here|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/SparkContext.scala#L1833]):
> {code}
> case null | "file" => addJarFile(new File(uri.getPath))
> {code}
> It uses 
> [getPath|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#getPath()]
>  method of 
> [java.net.URI|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html] 
> which url decodes the path. I believe method 
> [getRawPath|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#getRawPath()]
>  should be used here which keeps path string in original form.
> I tend to see this as a bug since I want to use my dependencies resolved from 
> artifactory with port directly. Is there some specific reason for this or can 
> we fix this?
> Thanks



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22586) Feature selection

2017-11-22 Thread Jorge Gonzalez Lopez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263498#comment-16263498
 ] 

Jorge Gonzalez Lopez commented on SPARK-22586:
--

Sorry, for posting on the wrong section. I'll forward the question to the 
mailing list.

> Feature selection 
> --
>
> Key: SPARK-22586
> URL: https://issues.apache.org/jira/browse/SPARK-22586
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Jorge Gonzalez Lopez
>Priority: Minor
>
> Hello everyone, 
> I would like to know if there are plans to add different score functions to 
> perform feature selection under the same interface. I saw two previous issues 
> related to the topic:
> https://issues.apache.org/jira/browse/SPARK-6531
> https://issues.apache.org/jira/browse/SPARK-1473
> However, it seems nothing was added at the end. I would like to know if there 
> was some problem then, because I wouldn't mind taking a closer look to it in 
> case people would be interested. 
> Additionally, I think it would be interested to include a score metric 
> between continuous attributes (for regression), and between continuous and 
> discrete (for classification). This has already been done successfully on 
> http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22584) dataframe write partitionBy out of disk/java heap issues

2017-11-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22584.
---
Resolution: Not A Problem

This doesn't sound like a bug. Running out of memory is 'normal' in that you'll 
often need to tune and pay attention to data distribution.

> dataframe write partitionBy out of disk/java heap issues
> 
>
> Key: SPARK-22584
> URL: https://issues.apache.org/jira/browse/SPARK-22584
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Derek M Miller
>
> I have been seeing some issues with partitionBy for the dataframe writer. I 
> currently have a file that is 6mb, just for testing, and it has around 1487 
> rows and 21 columns. There is nothing out of the ordinary with the columns, 
> having either a DoubleType or StringType. The partitionBy calls two different 
> partitions with verified low cardinality. One partition has 30 unique values 
> and the other one has 2 unique values.
> ```scala
> df
> .write.partitionBy("first", "second")
> .mode(SaveMode.Overwrite)
> .parquet(s"$location$example/$corrId/")
> ```
> When running this example on Amazon's EMR with 5 r4.xlarges (30 gb of memory 
> each), I am getting a java heap out of memory error. I have 
> maximizeResourceAllocation set, and verified on the instances. I have even 
> set it to false, explicitly set the driver and executor memory to 16g, but 
> still had the same issue. Occasionally I get an error about disk space, and 
> the job seems to work if I use an r3.xlarge (that has the ssd). But that 
> seems weird that 6mb of data needs to spill to disk.
> The problem mainly seems to be centered around two + partitions vs 1. If I 
> just use either of the partitions only, I have no problems. It's also worth 
> noting that each of the partitions are evenly distributed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22586) Feature selection

2017-11-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22586.
---
Resolution: Invalid

This should be a question for the mailing list.

> Feature selection 
> --
>
> Key: SPARK-22586
> URL: https://issues.apache.org/jira/browse/SPARK-22586
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Jorge Gonzalez Lopez
>Priority: Minor
>
> Hello everyone, 
> I would like to know if there are plans to add different score functions to 
> perform feature selection under the same interface. I saw two previous issues 
> related to the topic:
> https://issues.apache.org/jira/browse/SPARK-6531
> https://issues.apache.org/jira/browse/SPARK-1473
> However, it seems nothing was added at the end. I would like to know if there 
> was some problem then, because I wouldn't mind taking a closer look to it in 
> case people would be interested. 
> Additionally, I think it would be interested to include a score metric 
> between continuous attributes (for regression), and between continuous and 
> discrete (for classification). This has already been done successfully on 
> http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22586) Feature selection

2017-11-22 Thread Jorge Gonzalez Lopez (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Gonzalez Lopez updated SPARK-22586:
-
Environment: (was: Hello everyone, 

I would like to know if there are plans to add different score functions to 
perform feature selection under the same interface. I saw two previous issues 
related to the topic:

https://issues.apache.org/jira/browse/SPARK-6531
https://issues.apache.org/jira/browse/SPARK-1473

However, it seems nothing was added at the end. I would like to know if there 
was some problem then, because I wouldn't mind taking a closer look to it in 
case people would be interested. 

Additionally, I think it would be interested to include a score metric between 
continuous attributes (for regression), and between continuous and discrete 
(for classification). This has already been done successfully on 
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection)

> Feature selection 
> --
>
> Key: SPARK-22586
> URL: https://issues.apache.org/jira/browse/SPARK-22586
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Jorge Gonzalez Lopez
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22586) Feature selection

2017-11-22 Thread Jorge Gonzalez Lopez (JIRA)

Jorge Gonzalez Lopez created SPARK-22586:


 Summary: Feature selection 
 Key: SPARK-22586
 URL: https://issues.apache.org/jira/browse/SPARK-22586
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
 Environment: Hello everyone, 

I would like to know if there are plans to add different score functions to 
perform feature selection under the same interface. I saw two previous issues 
related to the topic:

https://issues.apache.org/jira/browse/SPARK-6531
https://issues.apache.org/jira/browse/SPARK-1473

However, it seems nothing was added at the end. I would like to know if there 
was some problem then, because I wouldn't mind taking a closer look to it in 
case people would be interested. 

Additionally, I think it would be interested to include a score metric between 
continuous attributes (for regression), and between continuous and discrete 
(for classification). This has already been done successfully on 
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
Reporter: Jorge Gonzalez Lopez
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22586) Feature selection

2017-11-22 Thread Jorge Gonzalez Lopez (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Gonzalez Lopez updated SPARK-22586:
-
Description: 
Hello everyone, 

I would like to know if there are plans to add different score functions to 
perform feature selection under the same interface. I saw two previous issues 
related to the topic:

https://issues.apache.org/jira/browse/SPARK-6531
https://issues.apache.org/jira/browse/SPARK-1473

However, it seems nothing was added at the end. I would like to know if there 
was some problem then, because I wouldn't mind taking a closer look to it in 
case people would be interested. 

Additionally, I think it would be interested to include a score metric between 
continuous attributes (for regression), and between continuous and discrete 
(for classification). This has already been done successfully on 
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

> Feature selection 
> --
>
> Key: SPARK-22586
> URL: https://issues.apache.org/jira/browse/SPARK-22586
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Jorge Gonzalez Lopez
>Priority: Minor
>
> Hello everyone, 
> I would like to know if there are plans to add different score functions to 
> perform feature selection under the same interface. I saw two previous issues 
> related to the topic:
> https://issues.apache.org/jira/browse/SPARK-6531
> https://issues.apache.org/jira/browse/SPARK-1473
> However, it seems nothing was added at the end. I would like to know if there 
> was some problem then, because I wouldn't mind taking a closer look to it in 
> case people would be interested. 
> Additionally, I think it would be interested to include a score metric 
> between continuous attributes (for regression), and between continuous and 
> discrete (for classification). This has already been done successfully on 
> http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22585) Url encoding of jar path expected?

2017-11-22 Thread Jakub Dubovsky (JIRA)

Jakub Dubovsky created SPARK-22585:
--

 Summary: Url encoding of jar path expected?
 Key: SPARK-22585
 URL: https://issues.apache.org/jira/browse/SPARK-22585
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Jakub Dubovsky


I am calling {code}sparkContext.addJar{code} method with path to a local jar I 
want to add. Example:
{code}/home/me/.coursier/cache/v1/https/artifactory.com%3A443/path/to.jar{code}.
 As a result I get an exception saying
{code}
Failed to add 
/home/me/.coursier/cache/v1/https/artifactory.com%3A443/path/to.jar to Spark 
environment. Stacktrace:
java.io.FileNotFoundException: Jar 
/home/me/.coursier/cache/v1/https/artifactory.com:443/path/to.jar not found
{code}
Important part to notice here is that colon character is url encoded in path I 
want to use but exception is complaining about path in decoded form. This is 
caused by this line of code from implementation ([see 
here|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/SparkContext.scala#L1833]):
{code}
case null | "file" => addJarFile(new File(uri.getPath))
{code}
It uses 
[getPath|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#getPath()] 
method of 
[java.net.URI|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html] 
which url decodes the path. I believe method 
[getRawPath|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#getRawPath()]
 should be used here which keeps path string in original form.

I tend to see this as a bug since I want to use my dependencies resolved from 
artifactory with port directly. Is there some specific reason for this or can 
we fix this?

Thanks



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22584) dataframe write partitionBy out of disk/java heap issues

2017-11-22 Thread Derek M Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Derek M Miller updated SPARK-22584:
---
Description: 
I have been seeing some issues with partitionBy for the dataframe writer. I 
currently have a file that is 6mb, just for testing, and it has around 1487 
rows and 21 columns. There is nothing out of the ordinary with the columns, 
having either a DoubleType or StringType. The partitionBy calls two different 
partitions with verified low cardinality. One partition has 30 unique values 
and the other one has 2 unique values.

```scala
df
.write.partitionBy("first", "second")
.mode(SaveMode.Overwrite)
.parquet(s"$location$example/$corrId/")
```

When running this example on Amazon's EMR with 5 r4.xlarges (30 gb of memory 
each), I am getting a java heap out of memory error. I have 
maximizeResourceAllocation set, and verified on the instances. I have even set 
it to false, explicitly set the driver and executor memory to 16g, but still 
had the same issue. Occasionally I get an error about disk space, and the job 
seems to work if I use an r3.xlarge (that has the ssd). But that seems weird 
that 6mb of data needs to spill to disk.

The problem mainly seems to be centered around two + partitions vs 1. If I just 
use either of the partitions only, I have no problems. It's also worth noting 
that each of the partitions are evenly distributed.

  was:
I have been seeing some issues with partitionBy for the dataframe writer. I 
currently have a file that is 6mb, just for testing, and it has around 1487 
rows and 21 columns. There is nothing out of the ordinary with the columns, 
having either a DoubleType or String The partitionBy calls two different 
partitions with verified low cardinality. One partition has 30 unique values 
and the other one has 2 unique values.

```scala
df
.write.partitionBy("first", "second")
.mode(SaveMode.Overwrite)
.parquet(s"$location$example/$corrId/")
```

When running this example on Amazon's EMR with 5 r4.xlarges (30 gb of memory), 
I am getting a java heap out of memory error. I have maximizeResourceAllocation 
set, and verified on the instances. I have even set it to false, explicitly set 
the driver and executor memory to 16g, but still had the same issue. 
Occasionally I get an error about disk space, and the job seems to work if I 
use an r3.xlarge (that has the ssd). But that seems weird that 6mb of data 
needs to spill to disk.

The problem mainly seems to be centered around two + partitions vs 1. If I just 
use either of the partitions only, I have no problems. It's also worth noting 
that each of the partitions are evenly distributed.


> dataframe write partitionBy out of disk/java heap issues
> 
>
> Key: SPARK-22584
> URL: https://issues.apache.org/jira/browse/SPARK-22584
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Derek M Miller
>
> I have been seeing some issues with partitionBy for the dataframe writer. I 
> currently have a file that is 6mb, just for testing, and it has around 1487 
> rows and 21 columns. There is nothing out of the ordinary with the columns, 
> having either a DoubleType or StringType. The partitionBy calls two different 
> partitions with verified low cardinality. One partition has 30 unique values 
> and the other one has 2 unique values.
> ```scala
> df
> .write.partitionBy("first", "second")
> .mode(SaveMode.Overwrite)
> .parquet(s"$location$example/$corrId/")
> ```
> When running this example on Amazon's EMR with 5 r4.xlarges (30 gb of memory 
> each), I am getting a java heap out of memory error. I have 
> maximizeResourceAllocation set, and verified on the instances. I have even 
> set it to false, explicitly set the driver and executor memory to 16g, but 
> still had the same issue. Occasionally I get an error about disk space, and 
> the job seems to work if I use an r3.xlarge (that has the ssd). But that 
> seems weird that 6mb of data needs to spill to disk.
> The problem mainly seems to be centered around two + partitions vs 1. If I 
> just use either of the partitions only, I have no problems. It's also worth 
> noting that each of the partitions are evenly distributed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22584) dataframe write partitionBy out of disk/java heap issues

2017-11-22 Thread Derek M Miller (JIRA)

Derek M Miller created SPARK-22584:
--

 Summary: dataframe write partitionBy out of disk/java heap issues
 Key: SPARK-22584
 URL: https://issues.apache.org/jira/browse/SPARK-22584
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Derek M Miller


I have been seeing some issues with partitionBy for the dataframe writer. I 
currently have a file that is 6mb, just for testing, and it has around 1487 
rows and 21 columns. There is nothing out of the ordinary with the columns, 
having either a DoubleType or String The partitionBy calls two different 
partitions with verified low cardinality. One partition has 30 unique values 
and the other one has 2 unique values.

```scala
df
.write.partitionBy("first", "second")
.mode(SaveMode.Overwrite)
.parquet(s"$location$example/$corrId/")
```

When running this example on Amazon's EMR with 5 r4.xlarges (30 gb of memory), 
I am getting a java heap out of memory error. I have maximizeResourceAllocation 
set, and verified on the instances. I have even set it to false, explicitly set 
the driver and executor memory to 16g, but still had the same issue. 
Occasionally I get an error about disk space, and the job seems to work if I 
use an r3.xlarge (that has the ssd). But that seems weird that 6mb of data 
needs to spill to disk.

The problem mainly seems to be centered around two + partitions vs 1. If I just 
use either of the partitions only, I have no problems. It's also worth noting 
that each of the partitions are evenly distributed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22578) CSV with quoted line breaks not correctly parsed

2017-11-22 Thread Carlos Barahona (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263192#comment-16263192
 ] 

Carlos Barahona edited comment on SPARK-22578 at 11/22/17 7:20 PM:
---

I didn't realize this was multiLine was disabled by default. However, using the 
option example you provided I have an upstream problem. While I'm obviously 
able to read the file using my earlier example, using the 
{code:java}
spark.read.option("multiLine", true).csv("tmp.csv").first
{code}

example, the job crashes with a

{code:java}
Caused by: java.io.FileNotFoundException: File file:tmp.csv does not exist
{code}

I can still turn around and not include the option and am able to read the file 
and display the first item.

Apologies if I'm missing something, I'm rather new to Spark.


was (Author: crbarahona):
I didn't realize this was multiLine was disabled by default. However, using the 
option example you provided I have an upstream problem. While I'm obviously 
able to read the file using my earlier example, using the 
{code:java}
spark.read.option("multiLine", true).csv("tmp.csv").first
{code}

example, the job crashes with a

{code:java}
Caused by: java.ioFileNotFoundException: File file:tmp.csv does not exist
{code}

I can still turn around and not include the option and am able to read the file 
and display the first item.

Apologies if I'm missing something, I'm rather new to Spark.

> CSV with quoted line breaks not correctly parsed
> 
>
> Key: SPARK-22578
> URL: https://issues.apache.org/jira/browse/SPARK-22578
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Barahona
>
> I believe the behavior addressed in SPARK-19610 still exists. Using spark 
> 2.2.0, when attempting to read in a CSV file containing a quoted newline, the 
> resulting dataset contains two separate items split along the quoted newline.
> Example text:
> {code:java}
> 4/28/2015 8:01,4/28/2015 8:19,0,100,1078,1,4/28/2015 8:19,email,"Hello
> World", 2,3,4,5
> {code}
> scala> val csvFile = spark.read.csv("file:///path")
> csvFile: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 7 
> more fields]
> scala> csvFile.first()
> res2: org.apache.spark.sql.Row = [4/28/2015 8:01,4/28/2015 
> 8:19,0,100,1078,1,4/28/2015 8:19,email,Hello]
> scala> csvFile.count()
> res3: Long = 2



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22578) CSV with quoted line breaks not correctly parsed

2017-11-22 Thread Carlos Barahona (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263192#comment-16263192
 ] 

Carlos Barahona commented on SPARK-22578:
-

I didn't realize this was multiLine was disabled by default. However, using the 
option example you provided I have an upstream problem. While I'm obviously 
able to read the file using my earlier example, using the 
{code:java}
spark.read.option("multiLine", true).csv("tmp.csv").first
{code}

example, the job crashes with a

{code:java}
Caused by: java.ioFileNotFoundException: File file:tmp.csv does not exist
{code}

I can still turn around and not include the option and am able to read the file 
and display the first item.

Apologies if I'm missing something, I'm rather new to Spark.

> CSV with quoted line breaks not correctly parsed
> 
>
> Key: SPARK-22578
> URL: https://issues.apache.org/jira/browse/SPARK-22578
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Barahona
>
> I believe the behavior addressed in SPARK-19610 still exists. Using spark 
> 2.2.0, when attempting to read in a CSV file containing a quoted newline, the 
> resulting dataset contains two separate items split along the quoted newline.
> Example text:
> {code:java}
> 4/28/2015 8:01,4/28/2015 8:19,0,100,1078,1,4/28/2015 8:19,email,"Hello
> World", 2,3,4,5
> {code}
> scala> val csvFile = spark.read.csv("file:///path")
> csvFile: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 7 
> more fields]
> scala> csvFile.first()
> res2: org.apache.spark.sql.Row = [4/28/2015 8:01,4/28/2015 
> 8:19,0,100,1078,1,4/28/2015 8:19,email,Hello]
> scala> csvFile.count()
> res3: Long = 2



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22526) Spark hangs while reading binary files from S3

2017-11-22 Thread mohamed imran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263097#comment-16263097
 ] 

mohamed imran commented on SPARK-22526:
---

[~ste...@apache.org] Yes. Thats sounds sense to me. Is there any  workaround 
for this?

> Spark hangs while reading binary files from S3
> --
>
> Key: SPARK-22526
> URL: https://issues.apache.org/jira/browse/SPARK-22526
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: mohamed imran
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hi,
> I am using Spark 2.2.0(recent version) to read binary files from S3. I use 
> sc.binaryfiles to read the files.
> It is working fine until some 100 file read but later it get hangs 
> indefinitely from 5 up to 40 mins like Avro file read issue(it was fixed in 
> the later releases)
> I tried setting the fs.s3a.connection.maximum to some maximum values but 
> didn't help.
> And finally i ended up using the spark speculation parameter set which is 
> again didnt help much. 
> One thing Which I observed is that it is not closing the connection after 
> every read of binary files from the S3.
> example :- sc.binaryFiles("s3a://test/test123.zip")
> Please look into this major issue!  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22395) Fix the behavior of timestamp values for Pandas to respect session timezone

2017-11-22 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22395:

Labels: release-notes  (was: )

> Fix the behavior of timestamp values for Pandas to respect session timezone
> ---
>
> Key: SPARK-22395
> URL: https://issues.apache.org/jira/browse/SPARK-22395
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>  Labels: release-notes
>
> When converting Pandas DataFrame/Series from/to Spark DataFrame using 
> {{toPandas()}} or pandas udfs, timestamp values behave to respect Python 
> system timezone instead of session timezone.
> For example, let's say we use {{"America/Los_Angeles"}} as session timezone 
> and have a timestamp value {{"1970-01-01 00:00:01"}} in the timezone. Btw, 
> I'm in Japan so Python timezone would be {{"Asia/Tokyo"}}.
> The timestamp value from current {{toPandas()}} will be the following:
> {noformat}
> >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> >>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) 
> >>> as ts")
> >>> df.show()
> +---+
> | ts|
> +---+
> |1970-01-01 00:00:01|
> +---+
> >>> df.toPandas()
>ts
> 0 1970-01-01 17:00:01
> {noformat}
> As you can see, the value becomes {{"1970-01-01 17:00:01"}} because it 
> respects Python timezone.
> As we discussed in https://github.com/apache/spark/pull/18664, we consider 
> this behavior is a bug and the value should be {{"1970-01-01 00:00:01"}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22583) First delegation token renewal time is not 75% of renewal time in Mesos

2017-11-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22583:


Assignee: (was: Apache Spark)

> First delegation token renewal time is not 75% of renewal time in Mesos
> ---
>
> Key: SPARK-22583
> URL: https://issues.apache.org/jira/browse/SPARK-22583
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Kalvin Chau
>Priority: Blocker
>
> The first renewal time of the delegation tokens is the exact renewal time. 
> This could lead to a situation where the tokens don't make it to executors in 
> time, and the executors will be working with expired tokens (and Exception 
> out).
> The subsequent renewal times are correctly set to 75% of total renewal time. 
> The initial renewal time just needs to be set to 75% of the renewal time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22583) First delegation token renewal time is not 75% of renewal time in Mesos

2017-11-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22583:


Assignee: Apache Spark

> First delegation token renewal time is not 75% of renewal time in Mesos
> ---
>
> Key: SPARK-22583
> URL: https://issues.apache.org/jira/browse/SPARK-22583
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Kalvin Chau
>Assignee: Apache Spark
>Priority: Blocker
>
> The first renewal time of the delegation tokens is the exact renewal time. 
> This could lead to a situation where the tokens don't make it to executors in 
> time, and the executors will be working with expired tokens (and Exception 
> out).
> The subsequent renewal times are correctly set to 75% of total renewal time. 
> The initial renewal time just needs to be set to 75% of the renewal time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22583) First delegation token renewal time is not 75% of renewal time in Mesos

2017-11-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263068#comment-16263068
 ] 

Apache Spark commented on SPARK-22583:
--

User 'kalvinnchau' has created a pull request for this issue:
https://github.com/apache/spark/pull/19798

> First delegation token renewal time is not 75% of renewal time in Mesos
> ---
>
> Key: SPARK-22583
> URL: https://issues.apache.org/jira/browse/SPARK-22583
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Kalvin Chau
>Priority: Blocker
>
> The first renewal time of the delegation tokens is the exact renewal time. 
> This could lead to a situation where the tokens don't make it to executors in 
> time, and the executors will be working with expired tokens (and Exception 
> out).
> The subsequent renewal times are correctly set to 75% of total renewal time. 
> The initial renewal time just needs to be set to 75% of the renewal time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22583) First delegation token renewal time is not 75% of renewal time in Mesos

2017-11-22 Thread Kalvin Chau (JIRA)

Kalvin Chau created SPARK-22583:
---

 Summary: First delegation token renewal time is not 75% of renewal 
time in Mesos
 Key: SPARK-22583
 URL: https://issues.apache.org/jira/browse/SPARK-22583
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.3.0
Reporter: Kalvin Chau
Priority: Blocker


The first renewal time of the delegation tokens is the exact renewal time. This 
could lead to a situation where the tokens don't make it to executors in time, 
and the executors will be working with expired tokens (and Exception out).

The subsequent renewal times are correctly set to 75% of total renewal time. 
The initial renewal time just needs to be set to 75% of the renewal time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22543) fix java 64kb compile error for deeply nested expressions

2017-11-22 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22543.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> fix java 64kb compile error for deeply nested expressions
> -
>
> Key: SPARK-22543
> URL: https://issues.apache.org/jira/browse/SPARK-22543
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22526) Spark hangs while reading binary files from S3

2017-11-22 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263021#comment-16263021
 ] 

Steve Loughran commented on SPARK-22526:


If the input stream doesn't get closed, there probably is going to be a GET 
kept open on every request. So I'd make sure that whatever you are running in 
your RDD does this. As the docs of {{PortableDataStream.open()}} say "The user 
of this method is responsible for closing the stream after usage."

> Spark hangs while reading binary files from S3
> --
>
> Key: SPARK-22526
> URL: https://issues.apache.org/jira/browse/SPARK-22526
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: mohamed imran
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hi,
> I am using Spark 2.2.0(recent version) to read binary files from S3. I use 
> sc.binaryfiles to read the files.
> It is working fine until some 100 file read but later it get hangs 
> indefinitely from 5 up to 40 mins like Avro file read issue(it was fixed in 
> the later releases)
> I tried setting the fs.s3a.connection.maximum to some maximum values but 
> didn't help.
> And finally i ended up using the spark speculation parameter set which is 
> again didnt help much. 
> One thing Which I observed is that it is not closing the connection after 
> every read of binary files from the S3.
> example :- sc.binaryFiles("s3a://test/test123.zip")
> Please look into this major issue!  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala

2017-11-22 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19878.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

> Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
> --
>
> Key: SPARK-19878
> URL: https://issues.apache.org/jira/browse/SPARK-19878
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0
> Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0
>Reporter: kavn qin
>Assignee: Vinod KC
>  Labels: patch
> Fix For: 2.2.1, 2.3.0
>
> Attachments: SPARK-19878.patch
>
>
> When case class InsertIntoHiveTable intializes a serde it explicitly passes 
> null for the Configuration in Spark 1.5.0:
> [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58]
> While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just 
> passes null for the Configuration:
> [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161]
> When we implement a hive serde, we want to use the hive configuration to  get 
> some static and dynamic settings, but we can not do it !
> So this patch add the configuration when initialize hive serde.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19580) Support for avro.schema.url while writing to hive table

2017-11-22 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19580.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

> Support for avro.schema.url while writing to hive table
> ---
>
> Key: SPARK-19580
> URL: https://issues.apache.org/jira/browse/SPARK-19580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Mateusz Boryn
>Assignee: Vinod KC
>Priority: Critical
> Fix For: 2.2.1, 2.3.0
>
>
> Support for writing to Hive table which uses Avro schema pointed to by 
> avro.schema.url is missing. 
> I have Hive table with Avro data format. Table is created with query like 
> this:
> {code:sql}
> CREATE TABLE some_table
>   PARTITIONED BY (YEAR int, MONTH int, DAY int)
>   ROW FORMAT SERDE
> 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>   STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>   OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
>   LOCATION 'hdfs:///user/some_user/some_table'
>   TBLPROPERTIES (
> 'avro.schema.url'='hdfs:///user/some_user/some_table.avsc'
>   )
> {code}
> Please notice that there is `avro.schema.url` and not `avro.schema.literal` 
> property, as we have to keep schemas in separate files for some reasons.
> Trying to write to such table results in NPE.
> Tried to find workaround for this, but nothing helps. Tried:
> - setting df.write.option("avroSchema", avroSchema) with explicit schema 
> in string
> - changing TBLPROPERTIES to SERDEPROPERTIES
> - replacing explicit detailed SERDE specification with STORED AS AVRO
> I found that this can be solved by adding a couple of lines in 
> `org.apache.spark.sql.hive.HiveShim` next to 
> `AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL` is referenced.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala

2017-11-22 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-19878:
---

Assignee: Vinod KC

> Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
> --
>
> Key: SPARK-19878
> URL: https://issues.apache.org/jira/browse/SPARK-19878
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0
> Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0
>Reporter: kavn qin
>Assignee: Vinod KC
>  Labels: patch
> Fix For: 2.2.1, 2.3.0
>
> Attachments: SPARK-19878.patch
>
>
> When case class InsertIntoHiveTable intializes a serde it explicitly passes 
> null for the Configuration in Spark 1.5.0:
> [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58]
> While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just 
> passes null for the Configuration:
> [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161]
> When we implement a hive serde, we want to use the hive configuration to  get 
> some static and dynamic settings, but we can not do it !
> So this patch add the configuration when initialize hive serde.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19580) Support for avro.schema.url while writing to hive table

2017-11-22 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-19580:
---

Assignee: Vinod KC

> Support for avro.schema.url while writing to hive table
> ---
>
> Key: SPARK-19580
> URL: https://issues.apache.org/jira/browse/SPARK-19580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Mateusz Boryn
>Assignee: Vinod KC
>Priority: Critical
>
> Support for writing to Hive table which uses Avro schema pointed to by 
> avro.schema.url is missing. 
> I have Hive table with Avro data format. Table is created with query like 
> this:
> {code:sql}
> CREATE TABLE some_table
>   PARTITIONED BY (YEAR int, MONTH int, DAY int)
>   ROW FORMAT SERDE
> 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>   STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>   OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
>   LOCATION 'hdfs:///user/some_user/some_table'
>   TBLPROPERTIES (
> 'avro.schema.url'='hdfs:///user/some_user/some_table.avsc'
>   )
> {code}
> Please notice that there is `avro.schema.url` and not `avro.schema.literal` 
> property, as we have to keep schemas in separate files for some reasons.
> Trying to write to such table results in NPE.
> Tried to find workaround for this, but nothing helps. Tried:
> - setting df.write.option("avroSchema", avroSchema) with explicit schema 
> in string
> - changing TBLPROPERTIES to SERDEPROPERTIES
> - replacing explicit detailed SERDE specification with STORED AS AVRO
> I found that this can be solved by adding a couple of lines in 
> `org.apache.spark.sql.hive.HiveShim` next to 
> `AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL` is referenced.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17920) HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url

2017-11-22 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-17920.
-
   Resolution: Fixed
 Assignee: Vinod KC
Fix Version/s: 2.3.0
   2.2.1

> HiveWriterContainer passes null configuration to serde.initialize, causing 
> NullPointerException in AvroSerde when using avro.schema.url
> ---
>
> Key: SPARK-17920
> URL: https://issues.apache.org/jira/browse/SPARK-17920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: AWS EMR 5.0.0: Spark 2.0.0, Hive 2.1.0
>Reporter: James Norvell
>Assignee: Vinod KC
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
> Attachments: avro.avsc, avro_data
>
>
> When HiveWriterContainer intializes a serde it explicitly passes null for the 
> Configuration:
> https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161
> When attempting to write to a table stored as Avro with avro.schema.url set, 
> this causes a NullPointerException when it tries to get the FileSystem for 
> the URL:
> https://github.com/apache/hive/blob/release-2.1.0-rc3/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSerdeUtils.java#L153
> Reproduction:
> {noformat}
> spark-sql> create external table avro_in (a string) stored as avro location 
> '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
> spark-sql> create external table avro_out (a string) stored as avro location 
> '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
> spark-sql> select * from avro_in;
> hello
> Time taken: 1.986 seconds, Fetched 1 row(s)
> spark-sql> insert overwrite table avro_out select * from avro_in;
> 16/10/13 19:34:47 WARN AvroSerDe: Encountered exception determining schema. 
> Returning signal schema to indicate problem
> java.lang.NullPointerException
>   at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:359)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:131)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:112)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:167)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:103)
>   at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:161)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:236)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:142)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:313)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:331)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>

[jira] [Assigned] (SPARK-22570) Create a lot of global variables to reuse an object in generated code

2017-11-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22570:


Assignee: Apache Spark

> Create a lot of global variables to reuse an object in generated code
> -
>
> Key: SPARK-22570
> URL: https://issues.apache.org/jira/browse/SPARK-22570
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> Generated code for several operations such as {{Cast}}, {{RegExpReplace}}, 
> and {{CreateArray}} has a lot of global variables in a class to reuse an 
> object. # of them may hit 64K constant pool entries limit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22570) Create a lot of global variables to reuse an object in generated code

2017-11-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262922#comment-16262922
 ] 

Apache Spark commented on SPARK-22570:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19797

> Create a lot of global variables to reuse an object in generated code
> -
>
> Key: SPARK-22570
> URL: https://issues.apache.org/jira/browse/SPARK-22570
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> Generated code for several operations such as {{Cast}}, {{RegExpReplace}}, 
> and {{CreateArray}} has a lot of global variables in a class to reuse an 
> object. # of them may hit 64K constant pool entries limit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22570) Create a lot of global variables to reuse an object in generated code

2017-11-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22570:


Assignee: (was: Apache Spark)

> Create a lot of global variables to reuse an object in generated code
> -
>
> Key: SPARK-22570
> URL: https://issues.apache.org/jira/browse/SPARK-22570
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> Generated code for several operations such as {{Cast}}, {{RegExpReplace}}, 
> and {{CreateArray}} has a lot of global variables in a class to reuse an 
> object. # of them may hit 64K constant pool entries limit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22570) Create a lot of global variables to reuse an object in generated code

2017-11-22 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22570:
-
Description: Generated code for several operations such as {{Cast}}, 
{{RegExpReplace}}, and {{CreateArray}} has a lot of global variables in a class 
to reuse an object. # of them may hit 64K constant pool entries limit.

> Create a lot of global variables to reuse an object in generated code
> -
>
> Key: SPARK-22570
> URL: https://issues.apache.org/jira/browse/SPARK-22570
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> Generated code for several operations such as {{Cast}}, {{RegExpReplace}}, 
> and {{CreateArray}} has a lot of global variables in a class to reuse an 
> object. # of them may hit 64K constant pool entries limit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22570) Create a lot of global variables to reuse an object

2017-11-22 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22570:
-
Summary: Create a lot of global variables to reuse an object  (was: Cast 
may create a lot of UTF8String.IntWrapper or UTF8String.longWrapper instances)

> Create a lot of global variables to reuse an object
> ---
>
> Key: SPARK-22570
> URL: https://issues.apache.org/jira/browse/SPARK-22570
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22570) Create a lot of global variables to reuse an object in generated code

2017-11-22 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22570:
-
Summary: Create a lot of global variables to reuse an object in generated 
code  (was: Create a lot of global variables to reuse an object)

> Create a lot of global variables to reuse an object in generated code
> -
>
> Key: SPARK-22570
> URL: https://issues.apache.org/jira/browse/SPARK-22570
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22582) Spark SQL round throws error with negative precision

2017-11-22 Thread Yuxin Cao (JIRA)

Yuxin Cao created SPARK-22582:
-

 Summary: Spark SQL round throws error with negative precision
 Key: SPARK-22582
 URL: https://issues.apache.org/jira/browse/SPARK-22582
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Yuxin Cao


select  round(100.1 , 1) as c3,
round(100.1 , -1) as c5 from orders;
Error: java.lang.IllegalArgumentException: Error: name expected at the position 
10 of 'decimal(4,-1)' but '-' is found. (state=,code=0)

The same query works fine in Spark 1.6.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22267) Spark SQL incorrectly reads ORC file when column order is different

2017-11-22 Thread Mark Petruska (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262850#comment-16262850
 ] 

Mark Petruska commented on SPARK-22267:
---

Hi [~dongjoon], I see that both issues mentioned are in progress.
Do you need some help or should I work on using `OrcFileFormat` even when 
"convertMetastoreOrc" is not set?

> Spark SQL incorrectly reads ORC file when column order is different
> ---
>
> Key: SPARK-22267
> URL: https://issues.apache.org/jira/browse/SPARK-22267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>
> For a long time, Apache Spark SQL returns incorrect results when ORC file 
> schema is different from metastore schema order.
> {code}
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("parquet").mode("overwrite").save("/tmp/p")
> scala> Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save("/tmp/o")
> scala> sql("CREATE EXTERNAL TABLE p(c2 INT, c1 INT) STORED AS parquet 
> LOCATION '/tmp/p'")
> scala> sql("CREATE EXTERNAL TABLE o(c2 INT, c1 INT) STORED AS orc LOCATION 
> '/tmp/o'")
> scala> spark.table("p").show  // Parquet is good.
> +---+---+
> | c2| c1|
> +---+---+
> |  2|  1|
> +---+---+
> scala> spark.table("o").show// This is wrong.
> +---+---+
> | c2| c1|
> +---+---+
> |  1|  2|
> +---+---+
> scala> spark.read.orc("/tmp/o").show  // This is correct.
> +---+---+
> | c1| c2|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> *TESTCASE*
> {code}
>   test("SPARK-22267 Spark SQL incorrectly reads ORC files when column order 
> is different") {
> withTempDir { dir =>
>   val path = dir.getCanonicalPath
>   Seq(1 -> 2).toDF("c1", 
> "c2").write.format("orc").mode("overwrite").save(path)
>   checkAnswer(spark.read.orc(path), Row(1, 2))
>   Seq("true", "false").foreach { value =>
> withTable("t") {
>   withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> value) {
> sql(s"CREATE EXTERNAL TABLE t(c2 INT, c1 INT) STORED AS ORC 
> LOCATION '$path'")
> checkAnswer(spark.table("t"), Row(2, 1))
>   }
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22564) csv reader no longer logs errors

2017-11-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22564.
---
Resolution: Won't Fix

> csv reader no longer logs errors
> 
>
> Key: SPARK-22564
> URL: https://issues.apache.org/jira/browse/SPARK-22564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Adrian Bridgett
>Priority: Minor
>
> Since upgrading from 2.0.2 to 2.2.0 we no longer see any malformed CSV 
> warnings in the executor logs.  It looks like this maybe related to 
> https://issues.apache.org/jira/browse/SPARK-19949 (where 
> maxMalformedLogPerPartition was removed) as it seems to be used but AFAICT, 
> not set anywhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262738#comment-16262738
 ] 

Hyukjin Kwon commented on SPARK-22516:
--

Sure. Please go ahead. Probably, you could refer the changes here - 
https://github.com/apache/spark/pull/19113/files. I opened a PR bumping the 
version of Univocity library before in order to to resolve an issue fixed in 
higher version of it. We probably also need a test case too likewise.

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>Priority: Minor
>  Labels: csvparser
> Attachments: testCommentChar.csv, test_file_without_eof_char.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/test
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/testCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
> Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
> at 
>

[jira] [Resolved] (SPARK-22580) Count after filtering uncached CSV for isnull(columnNameOfCorruptRecord) always 0

2017-11-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22580.
--
Resolution: Duplicate

> Count after filtering uncached CSV for isnull(columnNameOfCorruptRecord) 
> always 0
> -
>
> Key: SPARK-22580
> URL: https://issues.apache.org/jira/browse/SPARK-22580
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.0
> Environment: Same behavior on Debian and MS Windows (8.1) system. JRE 
> 1.8
>Reporter: Florian Kaspar
>
> It seems that doing counts after filtering for the parser-created 
> columnNameOfCorruptRecord and doing a count afterwards does not recognize any 
> invalid row that was put to this special column.
> Filtering for members of the actualSchema works fine and yields correct 
> counts.
> Input CSV example:
> {noformat}
> val1, cat1, 1.337
> val2, cat1, 1.337
> val3, cat2, 42.0
> some, invalid, line
> {noformat}
> Code snippet:
> {code:java}
> StructType schema = new StructType(new StructField[] { 
> new StructField("s1", DataTypes.StringType, true, 
> Metadata.empty()),
> new StructField("s2", DataTypes.StringType, true, 
> Metadata.empty()),
> new StructField("d1", DataTypes.DoubleType, true, 
> Metadata.empty()),
> new StructField("FALLBACK", DataTypes.StringType, true, 
> Metadata.empty())});
> Dataset csv = sqlContext.read()
> .option("header", "false")
> .option("parserLib", "univocity")
> .option("mode", "PERMISSIVE")
> .option("maxCharsPerColumn", 1000)
> .option("ignoreLeadingWhiteSpace", "false")
> .option("ignoreTrailingWhiteSpace", "false")
> .option("comment", null)
> .option("header", "false")
> .option("columnNameOfCorruptRecord", "FALLBACK")
> .schema(schema)
> .csv(path/to/csv/file);
>  long validCount = csv.filter("FALLBACK IS NULL").count();
>  long invalidCount = csv.filter("FALLBACK IS NOT NULL").count();
> {code}
> Expected: 
> validCount is 3
> Invalid Count is 1
> Actual:
> validCount is 4
> Invalid Count is 0
> Caching the csv after load solves the problem and shows the correct counts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22580) Count after filtering uncached CSV for isnull(columnNameOfCorruptRecord) always 0

2017-11-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262725#comment-16262725
 ] 

Hyukjin Kwon commented on SPARK-22580:
--

There was a limitation and discussion about it. I think it is fixed in 
https://github.com/apache/spark/pull/19199 with informing workaround. Please 
refer the discussion in https://github.com/apache/spark/pull/18865. Let me 
leave it as a duplicate of SPARK-21610

> Count after filtering uncached CSV for isnull(columnNameOfCorruptRecord) 
> always 0
> -
>
> Key: SPARK-22580
> URL: https://issues.apache.org/jira/browse/SPARK-22580
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.0
> Environment: Same behavior on Debian and MS Windows (8.1) system. JRE 
> 1.8
>Reporter: Florian Kaspar
>
> It seems that doing counts after filtering for the parser-created 
> columnNameOfCorruptRecord and doing a count afterwards does not recognize any 
> invalid row that was put to this special column.
> Filtering for members of the actualSchema works fine and yields correct 
> counts.
> Input CSV example:
> {noformat}
> val1, cat1, 1.337
> val2, cat1, 1.337
> val3, cat2, 42.0
> some, invalid, line
> {noformat}
> Code snippet:
> {code:java}
> StructType schema = new StructType(new StructField[] { 
> new StructField("s1", DataTypes.StringType, true, 
> Metadata.empty()),
> new StructField("s2", DataTypes.StringType, true, 
> Metadata.empty()),
> new StructField("d1", DataTypes.DoubleType, true, 
> Metadata.empty()),
> new StructField("FALLBACK", DataTypes.StringType, true, 
> Metadata.empty())});
> Dataset csv = sqlContext.read()
> .option("header", "false")
> .option("parserLib", "univocity")
> .option("mode", "PERMISSIVE")
> .option("maxCharsPerColumn", 1000)
> .option("ignoreLeadingWhiteSpace", "false")
> .option("ignoreTrailingWhiteSpace", "false")
> .option("comment", null)
> .option("header", "false")
> .option("columnNameOfCorruptRecord", "FALLBACK")
> .schema(schema)
> .csv(path/to/csv/file);
>  long validCount = csv.filter("FALLBACK IS NULL").count();
>  long invalidCount = csv.filter("FALLBACK IS NOT NULL").count();
> {code}
> Expected: 
> validCount is 3
> Invalid Count is 1
> Actual:
> validCount is 4
> Invalid Count is 0
> Caching the csv after load solves the problem and shows the correct counts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22575) Making Spark Thrift Server clean up its cache

2017-11-22 Thread Oz Ben-Ami (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262724#comment-16262724
 ] 

Oz Ben-Ami commented on SPARK-22575:


Thanks [~mgaido], we are not caching anything with {{CACHE TABLE}}. I have also 
tried running {{CLEAR CACHE}}, but it has no effect on disk usage.

> Making Spark Thrift Server clean up its cache
> -
>
> Key: SPARK-22575
> URL: https://issues.apache.org/jira/browse/SPARK-22575
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, SQL
>Affects Versions: 2.2.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>  Labels: cache, dataproc, thrift, yarn
>
> Currently, Spark Thrift Server accumulates data in its appcache, even for old 
> queries. This fills up the disk (using over 100GB per worker node) within 
> days, and the only way to clear it is to restart the Thrift Server 
> application. Even deleting the files directly isn't a solution, as Spark then 
> complains about FileNotFound.
> I asked about this on [Stack 
> Overflow|https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache]
>  a few weeks ago, but it does not seem to be currently doable by 
> configuration.
> Am I missing some configuration option, or some other factor here?
> Otherwise, can anyone point me to the code that handles this, so maybe I can 
> try my hand at a fix?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22579) BlockManager.getRemoteValues and BlockManager.getRemoteBytes should be implemented using streaming

2017-11-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22579:
--
Issue Type: Improvement  (was: Bug)

You have to read the data either way. I don't know the code but you'd have to 
be sure it never needs to read the data a second time after it's loaded too. 

> BlockManager.getRemoteValues and BlockManager.getRemoteBytes should be 
> implemented using streaming
> --
>
> Key: SPARK-22579
> URL: https://issues.apache.org/jira/browse/SPARK-22579
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 2.1.0
>Reporter: Eyal Farago
>
> when an RDD partition is cached on an executor bu the task requiring it is 
> running on another executor (process locality ANY), the cached partition is 
> fetched via BlockManager.getRemoteValues which delegates to 
> BlockManager.getRemoteBytes, both calls are blocking.
> in my use case I had a 700GB RDD spread over 1000 partitions on a 6 nodes 
> cluster, cached to disk. rough math shows that average partition size is 
> 700MB.
> looking at spark UI it was obvious that tasks running with process locality 
> 'ANY' are much slower than local tasks (~40 seconds to 8-10 minutes ratio), I 
> was able to capture thread dumps of executors executing remote tasks and got 
> this stake trace:
> {quote}Thread ID  Thread Name Thread StateThread Locks
> 1521  Executor task launch worker-1000WAITING 
> Lock(java.util.concurrent.ThreadPoolExecutor$Worker@196462978})
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:190)
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190)
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:104)
> org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:582)
> org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:550)
> org.apache.spark.storage.BlockManager.get(BlockManager.scala:638)
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:690)
> org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287){quote}
> digging into the code showed that the block manager first fetches all bytes 
> (getRemoteBytes) and then wraps it with a deserialization stream, this has 
> several draw backs:
> 1. blocking, requesting executor is blocked while the remote executor is 
> serving the block.
> 2. potentially large memory footprint on requesting executor, in my use case 
> a 700mb of raw bytes stored in a ChunkedByteBuffer.
> 3. inefficient, requesting side usually don't need all values at once as it 
> consumes the values via an iterator.
> 4. potentially large memory footprint on serving executor, in case the block 
> is cached in deserialized form the serving executor has to serialize it into 
> a ChunkedByteBuffer (BlockManager.doGetLocalBytes). this is both memory & CPU 
> intensive, memory footprint can be reduced by using a limited buffer for 
> serialization 'spilling' to the response stream.
> I suggest improving this either by implementing full streaming mechanism or 
> some kind of pagination mechanism, in addition the requesting executor should 
> be able to make progress with the data it already has, blocking only when 
> local buffer is exhausted and remote side didn't deliver the next chunk of 
> the stream (or page in case of pagination) yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (SPARK-22570) Cast may create a lot of UTF8String.IntWrapper or UTF8String.longWrapper instances

2017-11-22 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-22570:
-
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22510

> Cast may create a lot of UTF8String.IntWrapper or UTF8String.longWrapper 
> instances
> --
>
> Key: SPARK-22570
> URL: https://issues.apache.org/jira/browse/SPARK-22570
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19417) spark.files.overwrite is ignored

2017-11-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19417.
---
Resolution: Won't Fix

I think this behavior is on purpose, as these resources are effectively 
immutable. Letting them change might cause other odd behavior. You can find 
other ways to broadcast mutable state.

> spark.files.overwrite is ignored
> 
>
> Key: SPARK-19417
> URL: https://issues.apache.org/jira/browse/SPARK-19417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Chris Kanich
>
> I have not been able to get Spark to actually overwrite a file after I have 
> changed it on the driver node, re-called addFile, and then used it on the 
> executors again. Here's a failing test.
> {code}
>   test("can overwrite files when spark.files.overwrite is true") {
> val dir = Utils.createTempDir()
> val file = new File(dir, "file")
> try {
>   Files.write("one", file, StandardCharsets.UTF_8)
>   sc = new SparkContext(new 
> SparkConf().setAppName("test").setMaster("local-cluster[1,1,1024]")
>  .set("spark.files.overwrite", "true"))
>   sc.addFile(file.getAbsolutePath)
>   def getAddedFileContents(): String = {
> sc.parallelize(Seq(0)).map { _ =>
>   scala.io.Source.fromFile(SparkFiles.get("file")).mkString
> }.first()
>   }
>   assert(getAddedFileContents() === "one")
>   Files.write("two", file, StandardCharsets.UTF_8)
>   sc.addFile(file.getAbsolutePath)
>   assert(getAddedFileContents() === "onetwo")
> } finally {
>   Utils.deleteRecursively(dir)
>   sc.stop()
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22560) Must create spark session directly to connect to hive

2017-11-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22560:
--
Target Version/s:   (was: 2.2.0)
   Fix Version/s: (was: 2.2.0)

> Must create spark session directly to connect to hive
> -
>
> Key: SPARK-22560
> URL: https://issues.apache.org/jira/browse/SPARK-22560
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ran Mingxuan
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In a java project I have to use both JavaSparkContext  and SparkSession. I 
> find the order to create them affect hive connection.
> I have built a spark job like below:
> {code:java}
> // wrong code
> public void main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
> SparkSession spark = 
> SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate();
> spark.sql("show databases").show();
> }
> {code}
> and with this code spark job will not be able to find hive meta-store even if 
> it can discover correct warehouse.
> I have to use code like below to make things work:
> {code:java}
> // correct code 
> public String main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> SparkSession spark = 
> SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate();
> SparkContext sparkContext = spark.sparkContext();
> JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext);
> spark.sql("show databases").show();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-22 Thread Sandor Murakozi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262516#comment-16262516
 ] 

Sandor Murakozi commented on SPARK-22516:
-

I'm a newbie, I would be happy to happy to work on it. Would it be ok for you 
[~hyukjin.kwon]?

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>Priority: Minor
>  Labels: csvparser
> Attachments: testCommentChar.csv, test_file_without_eof_char.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/test
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/testCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
> Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
> at 
>

[jira] [Assigned] (SPARK-22581) Catalog api does not allow to specify partitioning columns with create(external)table

2017-11-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22581:


Assignee: Apache Spark

> Catalog api does not allow to specify partitioning columns with 
> create(external)table
> -
>
> Key: SPARK-22581
> URL: https://issues.apache.org/jira/browse/SPARK-22581
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Tim Van Wassenhove
>Assignee: Apache Spark
>Priority: Minor
>
> Currently it is not possible to specify partitioning columns via the Catalog 
> API when creating an (external) table. 
> Even though spark.sql will happily infer the partition scheme and find the 
> data, hive/beeline will not find data...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22581) Catalog api does not allow to specify partitioning columns with create(external)table

2017-11-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22581:


Assignee: (was: Apache Spark)

> Catalog api does not allow to specify partitioning columns with 
> create(external)table
> -
>
> Key: SPARK-22581
> URL: https://issues.apache.org/jira/browse/SPARK-22581
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Tim Van Wassenhove
>Priority: Minor
>
> Currently it is not possible to specify partitioning columns via the Catalog 
> API when creating an (external) table. 
> Even though spark.sql will happily infer the partition scheme and find the 
> data, hive/beeline will not find data...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22581) Catalog api does not allow to specify partitioning columns with create(external)table

2017-11-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262497#comment-16262497
 ] 

Apache Spark commented on SPARK-22581:
--

User 'timvw' has created a pull request for this issue:
https://github.com/apache/spark/pull/19796

> Catalog api does not allow to specify partitioning columns with 
> create(external)table
> -
>
> Key: SPARK-22581
> URL: https://issues.apache.org/jira/browse/SPARK-22581
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Tim Van Wassenhove
>Priority: Minor
>
> Currently it is not possible to specify partitioning columns via the Catalog 
> API when creating an (external) table. 
> Even though spark.sql will happily infer the partition scheme and find the 
> data, hive/beeline will not find data...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22526) Spark hangs while reading binary files from S3

2017-11-22 Thread mohamed imran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262489#comment-16262489
 ] 

mohamed imran commented on SPARK-22526:
---

[~ste...@apache.org] Thanks for your suggestions.

I did set the fs.s3a.connection.maximum value upto 1 and checked the 
connections using netstat -a  | grep CLOSE_WAIT  while processing the files 
from S3. Every read of each files ,connection pool value gets increased but it 
never closed the connections.

Due to which at some point of time It gets hang indefinitely.

Anyhow as per your suggestion, I will upgrade my hadoop-2.7.3 version to 
hadoop-2.8 and take the stats! keep you posted.

> Spark hangs while reading binary files from S3
> --
>
> Key: SPARK-22526
> URL: https://issues.apache.org/jira/browse/SPARK-22526
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: mohamed imran
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hi,
> I am using Spark 2.2.0(recent version) to read binary files from S3. I use 
> sc.binaryfiles to read the files.
> It is working fine until some 100 file read but later it get hangs 
> indefinitely from 5 up to 40 mins like Avro file read issue(it was fixed in 
> the later releases)
> I tried setting the fs.s3a.connection.maximum to some maximum values but 
> didn't help.
> And finally i ended up using the spark speculation parameter set which is 
> again didnt help much. 
> One thing Which I observed is that it is not closing the connection after 
> every read of binary files from the S3.
> example :- sc.binaryFiles("s3a://test/test123.zip")
> Please look into this major issue!  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala

2017-11-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262485#comment-16262485
 ] 

Apache Spark commented on SPARK-19878:
--

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/19795

> Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
> --
>
> Key: SPARK-19878
> URL: https://issues.apache.org/jira/browse/SPARK-19878
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0
> Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0
>Reporter: kavn qin
>  Labels: patch
> Attachments: SPARK-19878.patch
>
>
> When case class InsertIntoHiveTable intializes a serde it explicitly passes 
> null for the Configuration in Spark 1.5.0:
> [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58]
> While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just 
> passes null for the Configuration:
> [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161]
> When we implement a hive serde, we want to use the hive configuration to  get 
> some static and dynamic settings, but we can not do it !
> So this patch add the configuration when initialize hive serde.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19580) Support for avro.schema.url while writing to hive table

2017-11-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262484#comment-16262484
 ] 

Apache Spark commented on SPARK-19580:
--

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/19795

> Support for avro.schema.url while writing to hive table
> ---
>
> Key: SPARK-19580
> URL: https://issues.apache.org/jira/browse/SPARK-19580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Mateusz Boryn
>Priority: Critical
>
> Support for writing to Hive table which uses Avro schema pointed to by 
> avro.schema.url is missing. 
> I have Hive table with Avro data format. Table is created with query like 
> this:
> {code:sql}
> CREATE TABLE some_table
>   PARTITIONED BY (YEAR int, MONTH int, DAY int)
>   ROW FORMAT SERDE
> 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>   STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>   OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
>   LOCATION 'hdfs:///user/some_user/some_table'
>   TBLPROPERTIES (
> 'avro.schema.url'='hdfs:///user/some_user/some_table.avsc'
>   )
> {code}
> Please notice that there is `avro.schema.url` and not `avro.schema.literal` 
> property, as we have to keep schemas in separate files for some reasons.
> Trying to write to such table results in NPE.
> Tried to find workaround for this, but nothing helps. Tried:
> - setting df.write.option("avroSchema", avroSchema) with explicit schema 
> in string
> - changing TBLPROPERTIES to SERDEPROPERTIES
> - replacing explicit detailed SERDE specification with STORED AS AVRO
> I found that this can be solved by adding a couple of lines in 
> `org.apache.spark.sql.hive.HiveShim` next to 
> `AvroSerdeUtils.AvroTableProperties.SCHEMA_LITERAL` is referenced.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17920) HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url

2017-11-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262483#comment-16262483
 ] 

Apache Spark commented on SPARK-17920:
--

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/19795

> HiveWriterContainer passes null configuration to serde.initialize, causing 
> NullPointerException in AvroSerde when using avro.schema.url
> ---
>
> Key: SPARK-17920
> URL: https://issues.apache.org/jira/browse/SPARK-17920
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: AWS EMR 5.0.0: Spark 2.0.0, Hive 2.1.0
>Reporter: James Norvell
>Priority: Minor
> Attachments: avro.avsc, avro_data
>
>
> When HiveWriterContainer intializes a serde it explicitly passes null for the 
> Configuration:
> https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161
> When attempting to write to a table stored as Avro with avro.schema.url set, 
> this causes a NullPointerException when it tries to get the FileSystem for 
> the URL:
> https://github.com/apache/hive/blob/release-2.1.0-rc3/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSerdeUtils.java#L153
> Reproduction:
> {noformat}
> spark-sql> create external table avro_in (a string) stored as avro location 
> '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
> spark-sql> create external table avro_out (a string) stored as avro location 
> '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');
> spark-sql> select * from avro_in;
> hello
> Time taken: 1.986 seconds, Fetched 1 row(s)
> spark-sql> insert overwrite table avro_out select * from avro_in;
> 16/10/13 19:34:47 WARN AvroSerDe: Encountered exception determining schema. 
> Returning signal schema to indicate problem
> java.lang.NullPointerException
>   at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:359)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:131)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:112)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:167)
>   at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:103)
>   at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.newSerializer(hiveWriterContainers.scala:161)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:236)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:142)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:313)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:331)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
>

[jira] [Assigned] (SPARK-22572) spark-shell does not re-initialize on :replay

2017-11-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22572:


Assignee: Mark Petruska

> spark-shell does not re-initialize on :replay
> -
>
> Key: SPARK-22572
> URL: https://issues.apache.org/jira/browse/SPARK-22572
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Mark Petruska
>Assignee: Mark Petruska
>Priority: Minor
> Fix For: 2.3.0
>
>
> Spark-shell does not run the re-initialization script when a `:replay` 
> command is issued:
> {code}
> $ ./bin/spark-shell 
> 17/11/21 12:01:00 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context Web UI available at http://192.168.1.3:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1511262066013).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sc
> res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@77bb916f
> scala> :replay
> Replaying: sc
> :12: error: not found: value sc
>sc
>^
> scala> sc
> :12: error: not found: value sc
>sc
>^
> scala>
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22572) spark-shell does not re-initialize on :replay

2017-11-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22572.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19791
[https://github.com/apache/spark/pull/19791]

> spark-shell does not re-initialize on :replay
> -
>
> Key: SPARK-22572
> URL: https://issues.apache.org/jira/browse/SPARK-22572
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Mark Petruska
>Priority: Minor
> Fix For: 2.3.0
>
>
> Spark-shell does not run the re-initialization script when a `:replay` 
> command is issued:
> {code}
> $ ./bin/spark-shell 
> 17/11/21 12:01:00 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context Web UI available at http://192.168.1.3:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1511262066013).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sc
> res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@77bb916f
> scala> :replay
> Replaying: sc
> :12: error: not found: value sc
>sc
>^
> scala> sc
> :12: error: not found: value sc
>sc
>^
> scala>
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22581) Catalog api does not allow to specify partitioning columns with create(external)table

2017-11-22 Thread Tim Van Wassenhove (JIRA)

Tim Van Wassenhove created SPARK-22581:
--

 Summary: Catalog api does not allow to specify partitioning 
columns with create(external)table
 Key: SPARK-22581
 URL: https://issues.apache.org/jira/browse/SPARK-22581
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Tim Van Wassenhove
Priority: Minor


Currently it is not possible to specify partitioning columns via the Catalog 
API when creating an (external) table. 

Even though spark.sql will happily infer the partition scheme and find the 
data, hive/beeline will not find data...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22580) Count after filtering uncached CSV for isnull(columnNameOfCorruptRecord) always 0

2017-11-22 Thread Florian Kaspar (JIRA)

Florian Kaspar created SPARK-22580:
--

 Summary: Count after filtering uncached CSV for 
isnull(columnNameOfCorruptRecord) always 0
 Key: SPARK-22580
 URL: https://issues.apache.org/jira/browse/SPARK-22580
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.2.0
 Environment: Same behavior on Debian and MS Windows (8.1) system. JRE 
1.8
Reporter: Florian Kaspar


It seems that doing counts after filtering for the parser-created 
columnNameOfCorruptRecord and doing a count afterwards does not recognize any 
invalid row that was put to this special column.

Filtering for members of the actualSchema works fine and yields correct counts.

Input CSV example:
{noformat}
val1, cat1, 1.337
val2, cat1, 1.337
val3, cat2, 42.0
some, invalid, line
{noformat}

Code snippet:
{code:java}
StructType schema = new StructType(new StructField[] { 
new StructField("s1", DataTypes.StringType, true, 
Metadata.empty()),
new StructField("s2", DataTypes.StringType, true, 
Metadata.empty()),
new StructField("d1", DataTypes.DoubleType, true, 
Metadata.empty()),
new StructField("FALLBACK", DataTypes.StringType, true, 
Metadata.empty())});
Dataset csv = sqlContext.read()
.option("header", "false")
.option("parserLib", "univocity")
.option("mode", "PERMISSIVE")
.option("maxCharsPerColumn", 1000)
.option("ignoreLeadingWhiteSpace", "false")
.option("ignoreTrailingWhiteSpace", "false")
.option("comment", null)
.option("header", "false")
.option("columnNameOfCorruptRecord", "FALLBACK")
.schema(schema)
.csv(path/to/csv/file);
 long validCount = csv.filter("FALLBACK IS NULL").count();
 long invalidCount = csv.filter("FALLBACK IS NOT NULL").count();
{code}

Expected: 
validCount is 3
Invalid Count is 1

Actual:
validCount is 4
Invalid Count is 0

Caching the csv after load solves the problem and shows the correct counts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22575) Making Spark Thrift Server clean up its cache

2017-11-22 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262395#comment-16262395
 ] 

Marco Gaido commented on SPARK-22575:
-

You can use `UNCACHE TABLE` to remove them from cache if you have cached with 
`CACHE TABLE`.

> Making Spark Thrift Server clean up its cache
> -
>
> Key: SPARK-22575
> URL: https://issues.apache.org/jira/browse/SPARK-22575
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, SQL
>Affects Versions: 2.2.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>  Labels: cache, dataproc, thrift, yarn
>
> Currently, Spark Thrift Server accumulates data in its appcache, even for old 
> queries. This fills up the disk (using over 100GB per worker node) within 
> days, and the only way to clear it is to restart the Thrift Server 
> application. Even deleting the files directly isn't a solution, as Spark then 
> complains about FileNotFound.
> I asked about this on [Stack 
> Overflow|https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache]
>  a few weeks ago, but it does not seem to be currently doable by 
> configuration.
> Am I missing some configuration option, or some other factor here?
> Otherwise, can anyone point me to the code that handles this, so maybe I can 
> try my hand at a fix?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262334#comment-16262334
 ] 

Hyukjin Kwon commented on SPARK-22516:
--

Seems fixed in 2.5.9. We could probably bump up Univocity library.

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>Priority: Minor
>  Labels: csvparser
> Attachments: testCommentChar.csv, test_file_without_eof_char.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/test
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/testCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
> Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
> at 
>

[jira] [Assigned] (SPARK-21168) KafkaRDD should always set kafka clientId.

2017-11-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21168:


Assignee: (was: Apache Spark)

> KafkaRDD should always set kafka clientId.
> --
>
> Key: SPARK-21168
> URL: https://issues.apache.org/jira/browse/SPARK-21168
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Xingxing Di
>Priority: Trivial
>
> I found KafkaRDD not set kafka client.id in "fetchBatch" method 
> (FetchRequestBuilder will set clientId to empty by default),  normally this 
> will affect nothing, but in our case ,we use clientId at kafka server side, 
> so we have to rebuild spark-streaming-kafka。



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21168) KafkaRDD should always set kafka clientId.

2017-11-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21168:


Assignee: Apache Spark

> KafkaRDD should always set kafka clientId.
> --
>
> Key: SPARK-21168
> URL: https://issues.apache.org/jira/browse/SPARK-21168
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Xingxing Di
>Assignee: Apache Spark
>Priority: Trivial
>
> I found KafkaRDD not set kafka client.id in "fetchBatch" method 
> (FetchRequestBuilder will set clientId to empty by default),  normally this 
> will affect nothing, but in our case ,we use clientId at kafka server side, 
> so we have to rebuild spark-streaming-kafka。



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20101) Use OffHeapColumnVector when "spark.sql.columnVector.offheap.enable" is set to "true"

2017-11-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262190#comment-16262190
 ] 

Apache Spark commented on SPARK-20101:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19794

> Use OffHeapColumnVector when "spark.sql.columnVector.offheap.enable" is set 
> to "true"
> -
>
> Key: SPARK-20101
> URL: https://issues.apache.org/jira/browse/SPARK-20101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.3.0
>
>
> While {{ColumnVector}} has two implementations {{OnHeapColumnVector}} and 
> {{OffHeapColumnVector}}, only {{OnHeapColumnVector}} is used.
> This JIRA enables to use {{OffHeapColumnVector}} when 
> {{spark.memory.offHeap.enabled}} is set to {{true}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262189#comment-16262189
 ] 

Hyukjin Kwon commented on SPARK-22516:
--

This can be reproduced by:

{code}
spark.read.option("header","true").option("inferSchema", 
"true").option("multiLine", "true").option("comment", 
"g").csv("test_file_without_eof_char.csv").show()
{code}

The root cause seems from Univocity parser. I filed an issue there - 
https://github.com/uniVocity/univocity-parsers/issues/213

BTW, let's keep the description and reproducer clean as possible as we can. I 
was actually about to say the same things above ^ but realised it's a separate 
issue after multiple close looks. 

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>Priority: Minor
>  Labels: csvparser
> Attachments: testCommentChar.csv, test_file_without_eof_char.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/test
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/testCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote

[jira] [Updated] (SPARK-22574) Wrong request causing Spark Dispatcher going inactive

2017-11-22 Thread German Schiavon Matteo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

German Schiavon Matteo updated SPARK-22574:
---
Description: 
When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is 
causing a bad state of Dispatcher and making it inactive as a mesos framework.

The class CreateSubmissionRequest initialise its arguments to null as follows:

{code:title=CreateSubmissionRequest.scala|borderStyle=solid}
  var appResource: String = null
  var mainClass: String = null
  var appArgs: Array[String] = null
  var sparkProperties: Map[String, String] = null
  var environmentVariables: Map[String, String] = null
{code}

There are some checks of these variables but not in all of them, for example in 
appArgs and environmentVariables. 

If you don't set _appArgs_ it will cause the following error: 
{code:title=error|borderStyle=solid}
17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers.
Exception in thread "Thread-22" java.lang.NullPointerException
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621)
{code}

Because it's trying to access to it without checking whether is null or not.

 

  was:
When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is 
causing a bad state of Dispatcher and making it inactive as a mesos framework.

The class CreateSubmissionRequest initialise its arguments to null as follows:

{code:title=CreateSubmissionRequest.scala|borderStyle=solid}
  var appResource: String = null
  var mainClass: String = null
  var appArgs: Array[String] = null
  var sparkProperties: Map[String, String] = null
  var environmentVariables: Map[String, String] = null
{code}

There are some checks of this variable but not in all of them, for example in 
appArgs and environmentVariables. 

If you don't set _appArgs_ it will cause the following error: 
{code:title=error|borderStyle=solid}
17/11/21 14:37:24 INFO MesosClusterScheduler: Reviving Offers.
Exception in thread "Thread-22" java.lang.NullPointerException
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555)
at 
org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621)
{code}

Because it's trying to access to it without checking whether is null or not.

 


> Wrong request causing Spark Dispatcher going inactive
> -
>
> Key: SPARK-22574
> URL: https://issues.apache.org/jira/browse/SPARK-22574
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Submit
>Affects Versions: 2.2.0
>Reporter: German Schiavon Matteo
>Priority: Minor
> Fix For: 2.0.0, 2.1.0, 2.2.0
>
>
> When submitting a wrong _CreateSubmissionRequest_ to Spark Dispatcher is 
> causing a bad state of Dispatcher and making it

[jira] [Resolved] (SPARK-22578) CSV with quoted line breaks not correctly parsed

2017-11-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22578.
--
Resolution: Invalid

I just double checked with your input:

{code}
scala> spark.read.option("multiLine", true).csv("tmp.csv").first
res2: org.apache.spark.sql.Row =
[4/28/2015 8:01,4/28/2015 8:19,0,100,1078,1,4/28/2015 8:19,email,Hello

World, 2,3,4,5]

scala> spark.read.option("multiLine", true).csv("tmp.csv").count()
res3: Long = 1
{code}

Let me leave this resolved.

> CSV with quoted line breaks not correctly parsed
> 
>
> Key: SPARK-22578
> URL: https://issues.apache.org/jira/browse/SPARK-22578
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Barahona
>
> I believe the behavior addressed in SPARK-19610 still exists. Using spark 
> 2.2.0, when attempting to read in a CSV file containing a quoted newline, the 
> resulting dataset contains two separate items split along the quoted newline.
> Example text:
> {code:java}
> 4/28/2015 8:01,4/28/2015 8:19,0,100,1078,1,4/28/2015 8:19,email,"Hello
> World", 2,3,4,5
> {code}
> scala> val csvFile = spark.read.csv("file:///path")
> csvFile: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 7 
> more fields]
> scala> csvFile.first()
> res2: org.apache.spark.sql.Row = [4/28/2015 8:01,4/28/2015 
> 8:19,0,100,1078,1,4/28/2015 8:19,email,Hello]
> scala> csvFile.count()
> res3: Long = 2



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22560) Must create spark session directly to connect to hive

2017-11-22 Thread Ran Mingxuan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Mingxuan resolved SPARK-22560.
--
  Resolution: Works for Me
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

I add one line code to set spark context configuration after getting spark 
session configurations.

> Must create spark session directly to connect to hive
> -
>
> Key: SPARK-22560
> URL: https://issues.apache.org/jira/browse/SPARK-22560
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ran Mingxuan
> Fix For: 2.2.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In a java project I have to use both JavaSparkContext  and SparkSession. I 
> find the order to create them affect hive connection.
> I have built a spark job like below:
> {code:java}
> // wrong code
> public void main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
> SparkSession spark = 
> SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate();
> spark.sql("show databases").show();
> }
> {code}
> and with this code spark job will not be able to find hive meta-store even if 
> it can discover correct warehouse.
> I have to use code like below to make things work:
> {code:java}
> // correct code 
> public String main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> SparkSession spark = 
> SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate();
> SparkContext sparkContext = spark.sparkContext();
> JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext);
> spark.sql("show databases").show();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22579) BlockManager.getRemoteValues and BlockManager.getRemoteBytes should be implemented using streaming

2017-11-22 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262120#comment-16262120
 ] 

Eyal Farago commented on SPARK-22579:
-

CC: [~hvanhovell] (we've discussed this privately), [~joshrosen], 
[~jerryshao2015] (git annotations points at you guys :-) )

> BlockManager.getRemoteValues and BlockManager.getRemoteBytes should be 
> implemented using streaming
> --
>
> Key: SPARK-22579
> URL: https://issues.apache.org/jira/browse/SPARK-22579
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.1.0
>Reporter: Eyal Farago
>
> when an RDD partition is cached on an executor bu the task requiring it is 
> running on another executor (process locality ANY), the cached partition is 
> fetched via BlockManager.getRemoteValues which delegates to 
> BlockManager.getRemoteBytes, both calls are blocking.
> in my use case I had a 700GB RDD spread over 1000 partitions on a 6 nodes 
> cluster, cached to disk. rough math shows that average partition size is 
> 700MB.
> looking at spark UI it was obvious that tasks running with process locality 
> 'ANY' are much slower than local tasks (~40 seconds to 8-10 minutes ratio), I 
> was able to capture thread dumps of executors executing remote tasks and got 
> this stake trace:
> {quote}Thread ID  Thread Name Thread StateThread Locks
> 1521  Executor task launch worker-1000WAITING 
> Lock(java.util.concurrent.ThreadPoolExecutor$Worker@196462978})
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:190)
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190)
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:104)
> org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:582)
> org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:550)
> org.apache.spark.storage.BlockManager.get(BlockManager.scala:638)
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:690)
> org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287){quote}
> digging into the code showed that the block manager first fetches all bytes 
> (getRemoteBytes) and then wraps it with a deserialization stream, this has 
> several draw backs:
> 1. blocking, requesting executor is blocked while the remote executor is 
> serving the block.
> 2. potentially large memory footprint on requesting executor, in my use case 
> a 700mb of raw bytes stored in a ChunkedByteBuffer.
> 3. inefficient, requesting side usually don't need all values at once as it 
> consumes the values via an iterator.
> 4. potentially large memory footprint on serving executor, in case the block 
> is cached in deserialized form the serving executor has to serialize it into 
> a ChunkedByteBuffer (BlockManager.doGetLocalBytes). this is both memory & CPU 
> intensive, memory footprint can be reduced by using a limited buffer for 
> serialization 'spilling' to the response stream.
> I suggest improving this either by implementing full streaming mechanism or 
> some kind of pagination mechanism, in addition the requesting executor should 
> be able to make progress with the data it already has, blocking only when 
> local buffer is exhausted and remote side didn't deliver the next chunk of 
> the stream (or page in case of pagination) yet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-

[jira] [Commented] (SPARK-22578) CSV with quoted line breaks not correctly parsed

2017-11-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262119#comment-16262119
 ] 

Hyukjin Kwon commented on SPARK-22578:
--

Can you enable {{multiLine}} option? It's disabled by default.

> CSV with quoted line breaks not correctly parsed
> 
>
> Key: SPARK-22578
> URL: https://issues.apache.org/jira/browse/SPARK-22578
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Barahona
>
> I believe the behavior addressed in SPARK-19610 still exists. Using spark 
> 2.2.0, when attempting to read in a CSV file containing a quoted newline, the 
> resulting dataset contains two separate items split along the quoted newline.
> Example text:
> {code:java}
> 4/28/2015 8:01,4/28/2015 8:19,0,100,1078,1,4/28/2015 8:19,email,"Hello
> World", 2,3,4,5
> {code}
> scala> val csvFile = spark.read.csv("file:///path")
> csvFile: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 7 
> more fields]
> scala> csvFile.first()
> res2: org.apache.spark.sql.Row = [4/28/2015 8:01,4/28/2015 
> 8:19,0,100,1078,1,4/28/2015 8:19,email,Hello]
> scala> csvFile.count()
> res3: Long = 2



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22560) Must create spark session directly to connect to hive

2017-11-22 Thread Ran Mingxuan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262118#comment-16262118
 ] 

Ran Mingxuan commented on SPARK-22560:
--

In my opinion, options of spark session should be added to spark context in the 
case of building spark session via spark context with some additional options. 
In this way the conflict options will be corrected?


> Must create spark session directly to connect to hive
> -
>
> Key: SPARK-22560
> URL: https://issues.apache.org/jira/browse/SPARK-22560
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ran Mingxuan
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In a java project I have to use both JavaSparkContext  and SparkSession. I 
> find the order to create them affect hive connection.
> I have built a spark job like below:
> {code:java}
> // wrong code
> public void main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
> SparkSession spark = 
> SparkSession.builder().sparkContext(sc.sc()).enableHiveSupport().getOrCreate();
> spark.sql("show databases").show();
> }
> {code}
> and with this code spark job will not be able to find hive meta-store even if 
> it can discover correct warehouse.
> I have to use code like below to make things work:
> {code:java}
> // correct code 
> public String main(String[] args)
> {
> SparkConf sparkConf = new SparkConf().setAppName("testApp");
> SparkSession spark = 
> SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate();
> SparkContext sparkContext = spark.sparkContext();
> JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkContext);
> spark.sql("show databases").show();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

92 matches

Mail list logo