[jira] [Updated] (SPARK-21661) SparkSQL can't merge load table from Hadoop

2018-04-26 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21661:
-
Fix Version/s: 2.3.0

> SparkSQL can't merge load table from Hadoop
> ---
>
> Key: SPARK-21661
> URL: https://issues.apache.org/jira/browse/SPARK-21661
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dapeng Sun
>Assignee: Li Yuanjian
>Priority: Major
> Fix For: 2.3.0
>
>
> Here is the original text of external table on HDFS:
> {noformat}
> PermissionOwner   Group   SizeLast Modified   Replication Block 
> Size  Name
> -rw-r--r--rootsupergroup  0 B 8/6/2017, 11:43:03 PM   3   
> 256 MB  income_band_001.dat
> -rw-r--r--rootsupergroup  0 B 8/6/2017, 11:39:31 PM   3   
> 256 MB  income_band_002.dat
> ...
> -rw-r--r--rootsupergroup  327 B   8/6/2017, 11:44:47 PM   3   
> 256 MB  income_band_530.dat
> {noformat}
> After SparkSQL load, every files have a output file, even the files are 0B. 
> For the load on Hive, the data files would be merged according the data size 
> of original files.
> Reproduce:
> {noformat}
> CREATE EXTERNAL TABLE t1 (a int,b string)  STORED AS TEXTFILE LOCATION 
> "hdfs://xxx:9000/data/t1"
> CREATE TABLE t2 STORED AS PARQUET AS SELECT * FROM t1;
> {noformat}
> The table t2 have many small files without data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21661) SparkSQL can't merge load table from Hadoop

2017-08-07 Thread Dapeng Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dapeng Sun updated SPARK-21661:
---
Description: 
Here is the original text of external table on HDFS:
{noformat}
Permission  Owner   Group   SizeLast Modified   Replication Block 
Size  Name
-rw-r--r--  rootsupergroup  0 B 8/6/2017, 11:43:03 PM   3   
256 MB  income_band_001.dat
-rw-r--r--  rootsupergroup  0 B 8/6/2017, 11:39:31 PM   3   
256 MB  income_band_002.dat
...
-rw-r--r--  rootsupergroup  327 B   8/6/2017, 11:44:47 PM   3   
256 MB  income_band_530.dat
{noformat}
After SparkSQL load, every files have a output file, even the files are 0B. For 
the load on Hive, the data files would be merged according the data size of 
original files.

Reproduce:
{noformat}
CREATE EXTERNAL TABLE t1 (a int,b string)  STORED AS TEXTFILE LOCATION 
"hdfs://xxx:9000/data/t1"
CREATE TABLE t2 STORED AS PARQUET AS SELECT * FROM t1;
{noformat}

The table t2 have many small files without data.

  was:
Here is the original text of external table on HDFS:
{noformat}
Permission  Owner   Group   SizeLast Modified   Replication Block 
Size  Name
-rw-r--r--  rootsupergroup  0 B 8/6/2017, 11:43:03 PM   3   
256 MB  income_band_001.dat
-rw-r--r--  rootsupergroup  0 B 8/6/2017, 11:39:31 PM   3   
256 MB  income_band_002.dat
...
-rw-r--r--  rootsupergroup  327 B   8/6/2017, 11:44:47 PM   3   
256 MB  income_band_530.dat
{noformat}
After SparkSQL load, every files have a output file, even the files are 0B. For 
the load on Hive, the data files would be merged according the data size of 
original files.

CREATE EXTERNAL TABLE t1 (a int,b string) 


> SparkSQL can't merge load table from Hadoop
> ---
>
> Key: SPARK-21661
> URL: https://issues.apache.org/jira/browse/SPARK-21661
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dapeng Sun
>
> Here is the original text of external table on HDFS:
> {noformat}
> PermissionOwner   Group   SizeLast Modified   Replication Block 
> Size  Name
> -rw-r--r--rootsupergroup  0 B 8/6/2017, 11:43:03 PM   3   
> 256 MB  income_band_001.dat
> -rw-r--r--rootsupergroup  0 B 8/6/2017, 11:39:31 PM   3   
> 256 MB  income_band_002.dat
> ...
> -rw-r--r--rootsupergroup  327 B   8/6/2017, 11:44:47 PM   3   
> 256 MB  income_band_530.dat
> {noformat}
> After SparkSQL load, every files have a output file, even the files are 0B. 
> For the load on Hive, the data files would be merged according the data size 
> of original files.
> Reproduce:
> {noformat}
> CREATE EXTERNAL TABLE t1 (a int,b string)  STORED AS TEXTFILE LOCATION 
> "hdfs://xxx:9000/data/t1"
> CREATE TABLE t2 STORED AS PARQUET AS SELECT * FROM t1;
> {noformat}
> The table t2 have many small files without data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21661) SparkSQL can't merge load table from Hadoop

2017-08-07 Thread Dapeng Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dapeng Sun updated SPARK-21661:
---
Description: 
Here is the original text of external table on HDFS:
{noformat}
Permission  Owner   Group   SizeLast Modified   Replication Block 
Size  Name
-rw-r--r--  rootsupergroup  0 B 8/6/2017, 11:43:03 PM   3   
256 MB  income_band_001.dat
-rw-r--r--  rootsupergroup  0 B 8/6/2017, 11:39:31 PM   3   
256 MB  income_band_002.dat
...
-rw-r--r--  rootsupergroup  327 B   8/6/2017, 11:44:47 PM   3   
256 MB  income_band_530.dat
{noformat}
After SparkSQL load, every files have a output file, even the files are 0B. For 
the load on Hive, the data files would be merged according the data size of 
original files.

CREATE EXTERNAL TABLE t1 (a int,b string) 

  was:
Here is the original text of external table on HDFS:
{noformat}
Permission  Owner   Group   SizeLast Modified   Replication Block 
Size  Name
-rw-r--r--  rootsupergroup  0 B 8/6/2017, 11:43:03 PM   3   
256 MB  income_band_001.dat
-rw-r--r--  rootsupergroup  0 B 8/6/2017, 11:39:31 PM   3   
256 MB  income_band_002.dat
...
-rw-r--r--  rootsupergroup  327 B   8/6/2017, 11:44:47 PM   3   
256 MB  income_band_530.dat
{noformat}
After SparkSQL load, every files have a output file, even the files are 0B. For 
the load on Hive, the data files would be merged according the data size of 
original files.



> SparkSQL can't merge load table from Hadoop
> ---
>
> Key: SPARK-21661
> URL: https://issues.apache.org/jira/browse/SPARK-21661
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dapeng Sun
>
> Here is the original text of external table on HDFS:
> {noformat}
> PermissionOwner   Group   SizeLast Modified   Replication Block 
> Size  Name
> -rw-r--r--rootsupergroup  0 B 8/6/2017, 11:43:03 PM   3   
> 256 MB  income_band_001.dat
> -rw-r--r--rootsupergroup  0 B 8/6/2017, 11:39:31 PM   3   
> 256 MB  income_band_002.dat
> ...
> -rw-r--r--rootsupergroup  327 B   8/6/2017, 11:44:47 PM   3   
> 256 MB  income_band_530.dat
> {noformat}
> After SparkSQL load, every files have a output file, even the files are 0B. 
> For the load on Hive, the data files would be merged according the data size 
> of original files.
> CREATE EXTERNAL TABLE t1 (a int,b string) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21661) SparkSQL can't merge load table from Hadoop

2017-08-07 Thread Dapeng Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dapeng Sun updated SPARK-21661:
---
Description: 
Here is the original text of external table on HDFS:
{noformat}
Permission  Owner   Group   SizeLast Modified   Replication Block 
Size  Name
-rw-r--r--  rootsupergroup  0 B 8/6/2017, 11:43:03 PM   3   
256 MB  income_band_001.dat
-rw-r--r--  rootsupergroup  0 B 8/6/2017, 11:39:31 PM   3   
256 MB  income_band_002.dat
...
-rw-r--r--  rootsupergroup  327 B   8/6/2017, 11:44:47 PM   3   
256 MB  income_band_530.dat
{noformat}
After SparkSQL load, every files have a output file, even the files are 0B. For 
the load on Hive, the data files would be merged according the data size of 
original files.


  was:
Here is the original text of external table on HDFS:
{noformat}
Permission  Owner   Group   SizeLast Modified   Replication Block 
Size  Name
-rw-r--r--  rootsupergroup  0 B 8/6/2017, 11:43:03 PM   3   
256 MB  income_band_001.dat
-rw-r--r--  rootsupergroup  0 B 8/6/2017, 11:39:31 PM   3   
256 MB  income_band_002.dat
...
-rw-r--r--  rootsupergroup  327 B   8/6/2017, 11:44:47 PM   3   
256 MB  income_band_530.dat
{noformat}
After SparkSQL load, each files have a output, even the files are 0B. For the 
load on Hive, the data files would be merged according the data size of 
original files.



> SparkSQL can't merge load table from Hadoop
> ---
>
> Key: SPARK-21661
> URL: https://issues.apache.org/jira/browse/SPARK-21661
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dapeng Sun
>
> Here is the original text of external table on HDFS:
> {noformat}
> PermissionOwner   Group   SizeLast Modified   Replication Block 
> Size  Name
> -rw-r--r--rootsupergroup  0 B 8/6/2017, 11:43:03 PM   3   
> 256 MB  income_band_001.dat
> -rw-r--r--rootsupergroup  0 B 8/6/2017, 11:39:31 PM   3   
> 256 MB  income_band_002.dat
> ...
> -rw-r--r--rootsupergroup  327 B   8/6/2017, 11:44:47 PM   3   
> 256 MB  income_band_530.dat
> {noformat}
> After SparkSQL load, every files have a output file, even the files are 0B. 
> For the load on Hive, the data files would be merged according the data size 
> of original files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org