[jira] [Commented] (SPARK-19131) Support "alter table drop partition [if exists]"

2017-01-08 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15810768#comment-15810768
 ] 

lichenglin commented on SPARK-19131:


Yes

> Support "alter table drop partition [if exists]"
> 
>
> Key: SPARK-19131
> URL: https://issues.apache.org/jira/browse/SPARK-19131
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 2.1.0
>Reporter: lichenglin
>
> {code}
> val parts = client.getPartitions(hiveTable, s.asJava).asScala
> if (parts.isEmpty && !ignoreIfNotExists) {
>   throw new AnalysisException(
> s"No partition is dropped. One partition spec '$s' does not exist 
> in table '$table' " +
> s"database '$db'")
> }
> parts.map(_.getValues)
> {code}
> Until 2.1.0,drop partition will throw a exception when no partition to drop.
> I notice there is a param named ignoreIfNotExists.
> But I don't know how to set it.
> May be we can implement "alter table drop partition [if exists] " 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19131) Support "alter table drop partition [if exists]"

2017-01-08 Thread lichenglin (JIRA)
lichenglin created SPARK-19131:
--

 Summary: Support "alter table drop partition [if exists]"
 Key: SPARK-19131
 URL: https://issues.apache.org/jira/browse/SPARK-19131
 Project: Spark
  Issue Type: New Feature
Affects Versions: 2.1.0
Reporter: lichenglin


{code}
val parts = client.getPartitions(hiveTable, s.asJava).asScala
if (parts.isEmpty && !ignoreIfNotExists) {
  throw new AnalysisException(
s"No partition is dropped. One partition spec '$s' does not exist 
in table '$table' " +
s"database '$db'")
}
parts.map(_.getValues)
{code}

Until 2.1.0,drop partition will throw a exception when no partition to drop.
I notice there is a param named ignoreIfNotExists.
But I don't know how to set it.
May be we can implement "alter table drop partition [if exists] " 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19129) alter table table_name drop partition with a empty string will drop the whole table

2017-01-08 Thread lichenglin (JIRA)
lichenglin created SPARK-19129:
--

 Summary: alter table table_name drop partition with a empty string 
will drop the whole table
 Key: SPARK-19129
 URL: https://issues.apache.org/jira/browse/SPARK-19129
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: lichenglin


{code}
val spark = SparkSession
  .builder
  .appName("PartitionDropTest")
  .master("local[2]").enableHiveSupport()
  .getOrCreate()
val sentenceData = spark.createDataFrame(Seq(
  (0, "a"),
  (1, "b"),
  (2, "c")))
  .toDF("id", "name")
spark.sql("drop table if exists licllocal.partition_table")

sentenceData.write.mode(SaveMode.Overwrite).partitionBy("id").saveAsTable("licllocal.partition_table")
spark.sql("alter table licllocal.partition_table drop partition(id='')")
spark.table("licllocal.partition_table").show()
{code}
the result is 
{code}
|name| id|
++---+
++---+
{code}

Maybe the partition match have something wrong when the partition value is set 
to empty string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19075) Plz make MinMaxScaler can work with a Number type field

2017-01-03 Thread lichenglin (JIRA)
lichenglin created SPARK-19075:
--

 Summary: Plz make MinMaxScaler can work with a Number type field
 Key: SPARK-19075
 URL: https://issues.apache.org/jira/browse/SPARK-19075
 Project: Spark
  Issue Type: Wish
  Components: ML
Affects Versions: 2.0.2
Reporter: lichenglin


until now MinMaxScaler can only work with Vector type filed.

Plz make it can work with double type and other number type.

so do MaxAbsScaler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18893) Not support "alter table .. add columns .."

2016-12-15 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753304#comment-15753304
 ] 

lichenglin commented on SPARK-18893:


spark 2.0 has disable "alter table".

[https://issues.apache.org/jira/browse/SPARK-14118]

[https://issues.apache.org/jira/browse/SPARK-14130]

I think this is very import feature for data warehouse 

Can spark handle it firstly?

> Not support "alter table .. add columns .." 
> 
>
> Key: SPARK-18893
> URL: https://issues.apache.org/jira/browse/SPARK-18893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: zuotingbing
>
> when we update spark from version 1.5.2 to 2.0.1, all cases we have need 
> change the table use "alter table add columns " failed, but it is said "All 
> Hive DDL Functions, including: alter table" in the official document : 
> http://spark.apache.org/docs/latest/sql-programming-guide.html.
> Is there any plan to support  sql "alter table .. add/replace columns" ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14130) [Table related commands] Alter column

2016-12-15 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753146#comment-15753146
 ] 

lichenglin edited comment on SPARK-14130 at 12/16/16 2:00 AM:
--

"TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse.

Does spark have any plan to support  for it??

Even though it only works on hivecontext with specially fileformat




was (Author: licl):
"TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse.

Does spark have any plan to support  for it??



> [Table related commands] Alter column
> -
>
> Key: SPARK-14130
> URL: https://issues.apache.org/jira/browse/SPARK-14130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> For alter column command, we have the following tokens.
> TOK_ALTERTABLE_RENAMECOL
> TOK_ALTERTABLE_ADDCOLS
> TOK_ALTERTABLE_REPLACECOLS
> For data source tables, we should throw exceptions. For Hive tables, we 
> should support them. *For Hive tables, we should check Hive's behavior to see 
> if there is any file format that does not any of above command*. 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
>  is a good reference for Hive's behavior. 
> Also, for a Hive table stored in a format, we need to make sure that even if 
> Spark can read this tables after an alter column operation. If we cannot read 
> the table, even Hive allows the alter column operation, we should still throw 
> an exception. For example, if renaming a column of a Hive parquet table 
> causes the renamed column inaccessible (we cannot read values), we should not 
> allow this renaming operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-14130) [Table related commands] Alter column

2016-12-15 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-14130:
---
Comment: was deleted

(was: "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse.

Does spark have any plan to support  for it??

)

> [Table related commands] Alter column
> -
>
> Key: SPARK-14130
> URL: https://issues.apache.org/jira/browse/SPARK-14130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> For alter column command, we have the following tokens.
> TOK_ALTERTABLE_RENAMECOL
> TOK_ALTERTABLE_ADDCOLS
> TOK_ALTERTABLE_REPLACECOLS
> For data source tables, we should throw exceptions. For Hive tables, we 
> should support them. *For Hive tables, we should check Hive's behavior to see 
> if there is any file format that does not any of above command*. 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
>  is a good reference for Hive's behavior. 
> Also, for a Hive table stored in a format, we need to make sure that even if 
> Spark can read this tables after an alter column operation. If we cannot read 
> the table, even Hive allows the alter column operation, we should still throw 
> an exception. For example, if renaming a column of a Hive parquet table 
> causes the renamed column inaccessible (we cannot read values), we should not 
> allow this renaming operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-14130) [Table related commands] Alter column

2016-12-15 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-14130:
---
Comment: was deleted

(was: "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse.

Does spark have any plan to support  for it??

)

> [Table related commands] Alter column
> -
>
> Key: SPARK-14130
> URL: https://issues.apache.org/jira/browse/SPARK-14130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> For alter column command, we have the following tokens.
> TOK_ALTERTABLE_RENAMECOL
> TOK_ALTERTABLE_ADDCOLS
> TOK_ALTERTABLE_REPLACECOLS
> For data source tables, we should throw exceptions. For Hive tables, we 
> should support them. *For Hive tables, we should check Hive's behavior to see 
> if there is any file format that does not any of above command*. 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
>  is a good reference for Hive's behavior. 
> Also, for a Hive table stored in a format, we need to make sure that even if 
> Spark can read this tables after an alter column operation. If we cannot read 
> the table, even Hive allows the alter column operation, we should still throw 
> an exception. For example, if renaming a column of a Hive parquet table 
> causes the renamed column inaccessible (we cannot read values), we should not 
> allow this renaming operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14130) [Table related commands] Alter column

2016-12-15 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753147#comment-15753147
 ] 

lichenglin commented on SPARK-14130:


"TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse.

Does spark have any plan to support  for it??



> [Table related commands] Alter column
> -
>
> Key: SPARK-14130
> URL: https://issues.apache.org/jira/browse/SPARK-14130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> For alter column command, we have the following tokens.
> TOK_ALTERTABLE_RENAMECOL
> TOK_ALTERTABLE_ADDCOLS
> TOK_ALTERTABLE_REPLACECOLS
> For data source tables, we should throw exceptions. For Hive tables, we 
> should support them. *For Hive tables, we should check Hive's behavior to see 
> if there is any file format that does not any of above command*. 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
>  is a good reference for Hive's behavior. 
> Also, for a Hive table stored in a format, we need to make sure that even if 
> Spark can read this tables after an alter column operation. If we cannot read 
> the table, even Hive allows the alter column operation, we should still throw 
> an exception. For example, if renaming a column of a Hive parquet table 
> causes the renamed column inaccessible (we cannot read values), we should not 
> allow this renaming operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14130) [Table related commands] Alter column

2016-12-15 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753148#comment-15753148
 ] 

lichenglin commented on SPARK-14130:


"TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse.

Does spark have any plan to support  for it??



> [Table related commands] Alter column
> -
>
> Key: SPARK-14130
> URL: https://issues.apache.org/jira/browse/SPARK-14130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> For alter column command, we have the following tokens.
> TOK_ALTERTABLE_RENAMECOL
> TOK_ALTERTABLE_ADDCOLS
> TOK_ALTERTABLE_REPLACECOLS
> For data source tables, we should throw exceptions. For Hive tables, we 
> should support them. *For Hive tables, we should check Hive's behavior to see 
> if there is any file format that does not any of above command*. 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
>  is a good reference for Hive's behavior. 
> Also, for a Hive table stored in a format, we need to make sure that even if 
> Spark can read this tables after an alter column operation. If we cannot read 
> the table, even Hive allows the alter column operation, we should still throw 
> an exception. For example, if renaming a column of a Hive parquet table 
> causes the renamed column inaccessible (we cannot read values), we should not 
> allow this renaming operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14130) [Table related commands] Alter column

2016-12-15 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753146#comment-15753146
 ] 

lichenglin commented on SPARK-14130:


"TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse.

Does spark have any plan to support  for it??



> [Table related commands] Alter column
> -
>
> Key: SPARK-14130
> URL: https://issues.apache.org/jira/browse/SPARK-14130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> For alter column command, we have the following tokens.
> TOK_ALTERTABLE_RENAMECOL
> TOK_ALTERTABLE_ADDCOLS
> TOK_ALTERTABLE_REPLACECOLS
> For data source tables, we should throw exceptions. For Hive tables, we 
> should support them. *For Hive tables, we should check Hive's behavior to see 
> if there is any file format that does not any of above command*. 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
>  is a good reference for Hive's behavior. 
> Also, for a Hive table stored in a format, we need to make sure that even if 
> Spark can read this tables after an alter column operation. If we cannot read 
> the table, even Hive allows the alter column operation, we should still throw 
> an exception. For example, if renaming a column of a Hive parquet table 
> causes the renamed column inaccessible (we cannot read values), we should not 
> allow this renaming operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18441) Add Smote in spark mlib and ml

2016-11-15 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669125#comment-15669125
 ] 

lichenglin commented on SPARK-18441:


Thanks ,It works now

> Add Smote in spark mlib and ml
> --
>
> Key: SPARK-18441
> URL: https://issues.apache.org/jira/browse/SPARK-18441
> Project: Spark
>  Issue Type: Wish
>  Components: ML, MLlib
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> PLZ Add Smote in spark mlib and ml in case of  the "not balance of train 
> data" for Classification



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18441) Add Smote in spark mlib and ml

2016-11-15 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669050#comment-15669050
 ] 

lichenglin commented on SPARK-18441:


Thanks for your reply.
May I ask what version of spark this Smote use.
I'm using spark2.0.1.
But some error occur because of the "org.apache.spark.ml.linalg.BLAS" 

my eclipse can't recognize this class.

> Add Smote in spark mlib and ml
> --
>
> Key: SPARK-18441
> URL: https://issues.apache.org/jira/browse/SPARK-18441
> Project: Spark
>  Issue Type: Wish
>  Components: ML, MLlib
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> PLZ Add Smote in spark mlib and ml in case of  the "not balance of train 
> data" for Classification



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18441) Add Smote in spark mlib and ml

2016-11-14 Thread lichenglin (JIRA)
lichenglin created SPARK-18441:
--

 Summary: Add Smote in spark mlib and ml
 Key: SPARK-18441
 URL: https://issues.apache.org/jira/browse/SPARK-18441
 Project: Spark
  Issue Type: Wish
  Components: ML, MLlib
Affects Versions: 2.0.1
Reporter: lichenglin


PLZ Add Smote in spark mlib and ml in case of  the "not balance of train data" 
for Classification



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-13 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662759#comment-15662759
 ] 

lichenglin commented on SPARK-18413:


 I'm sorry,my network is too bad to download dependencies from maven rep for 
building spark.
I have made a comment on your PR,please check if it is right.
Thanks

> Add a property to control the number of partitions when save a jdbc rdd
> ---
>
> Key: SPARK-18413
> URL: https://issues.apache.org/jira/browse/SPARK-18413
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> {code}
> CREATE or replace TEMPORARY VIEW resultview
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
>   dbtable "result",
>   user "HIVE",
>   password "HIVE"
> );
> --set spark.sql.shuffle.partitions=200
> insert overwrite table resultview select g,count(1) as count from 
> tnet.DT_LIVE_INFO group by g
> {code}
> I'm tring to save a spark sql result to oracle.
> And I found spark will create a jdbc connection for each partition.
> if the sql create to many partitions , the database can't hold so many 
> connections and return exception.
> In above situation is 200 because of the "group by" and 
> "spark.sql.shuffle.partitions"
> the spark source code JdbcUtil is
> {code}
> def saveTable(
>   df: DataFrame,
>   url: String,
>   table: String,
>   properties: Properties) {
> val dialect = JdbcDialects.get(url)
> val nullTypes: Array[Int] = df.schema.fields.map { field =>
>   getJdbcType(field.dataType, dialect).jdbcNullType
> }
> val rddSchema = df.schema
> val getConnection: () => Connection = createConnectionFactory(url, 
> properties)
> val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, 
> "1000").toInt
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
> batchSize, dialect)
> }
>   }
> {code}
> May be we can add a property for df.repartition(num).foreachPartition ?
> In fact I got an exception "ORA-12519, TNS:no appropriate service handler 
> found"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-11 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658921#comment-15658921
 ] 

lichenglin commented on SPARK-18413:


Sorry,I can't.
I'm a rookie and have a really terrible network...


> Add a property to control the number of partitions when save a jdbc rdd
> ---
>
> Key: SPARK-18413
> URL: https://issues.apache.org/jira/browse/SPARK-18413
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> {code}
> CREATE or replace TEMPORARY VIEW resultview
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
>   dbtable "result",
>   user "HIVE",
>   password "HIVE"
> );
> --set spark.sql.shuffle.partitions=200
> insert overwrite table resultview select g,count(1) as count from 
> tnet.DT_LIVE_INFO group by g
> {code}
> I'm tring to save a spark sql result to oracle.
> And I found spark will create a jdbc connection for each partition.
> if the sql create to many partitions , the database can't hold so many 
> connections and return exception.
> In above situation is 200 because of the "group by" and 
> "spark.sql.shuffle.partitions"
> the spark source code JdbcUtil is
> {code}
> def saveTable(
>   df: DataFrame,
>   url: String,
>   table: String,
>   properties: Properties) {
> val dialect = JdbcDialects.get(url)
> val nullTypes: Array[Int] = df.schema.fields.map { field =>
>   getJdbcType(field.dataType, dialect).jdbcNullType
> }
> val rddSchema = df.schema
> val getConnection: () => Connection = createConnectionFactory(url, 
> properties)
> val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, 
> "1000").toInt
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
> batchSize, dialect)
> }
>   }
> {code}
> May be we can add a property for df.repartition(num).foreachPartition ?
> In fact I got an exception "ORA-12519, TNS:no appropriate service handler 
> found"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-11 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658920#comment-15658920
 ] 

lichenglin commented on SPARK-18413:


Sorry,I can't.
I'm a rookie and have a really terrible network...


> Add a property to control the number of partitions when save a jdbc rdd
> ---
>
> Key: SPARK-18413
> URL: https://issues.apache.org/jira/browse/SPARK-18413
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> {code}
> CREATE or replace TEMPORARY VIEW resultview
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
>   dbtable "result",
>   user "HIVE",
>   password "HIVE"
> );
> --set spark.sql.shuffle.partitions=200
> insert overwrite table resultview select g,count(1) as count from 
> tnet.DT_LIVE_INFO group by g
> {code}
> I'm tring to save a spark sql result to oracle.
> And I found spark will create a jdbc connection for each partition.
> if the sql create to many partitions , the database can't hold so many 
> connections and return exception.
> In above situation is 200 because of the "group by" and 
> "spark.sql.shuffle.partitions"
> the spark source code JdbcUtil is
> {code}
> def saveTable(
>   df: DataFrame,
>   url: String,
>   table: String,
>   properties: Properties) {
> val dialect = JdbcDialects.get(url)
> val nullTypes: Array[Int] = df.schema.fields.map { field =>
>   getJdbcType(field.dataType, dialect).jdbcNullType
> }
> val rddSchema = df.schema
> val getConnection: () => Connection = createConnectionFactory(url, 
> properties)
> val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, 
> "1000").toInt
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
> batchSize, dialect)
> }
>   }
> {code}
> May be we can add a property for df.repartition(num).foreachPartition ?
> In fact I got an exception "ORA-12519, TNS:no appropriate service handler 
> found"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-11 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-18413:
---
Comment: was deleted

(was: Sorry,I can't.
I'm a rookie and have a really terrible network...
)

> Add a property to control the number of partitions when save a jdbc rdd
> ---
>
> Key: SPARK-18413
> URL: https://issues.apache.org/jira/browse/SPARK-18413
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> {code}
> CREATE or replace TEMPORARY VIEW resultview
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
>   dbtable "result",
>   user "HIVE",
>   password "HIVE"
> );
> --set spark.sql.shuffle.partitions=200
> insert overwrite table resultview select g,count(1) as count from 
> tnet.DT_LIVE_INFO group by g
> {code}
> I'm tring to save a spark sql result to oracle.
> And I found spark will create a jdbc connection for each partition.
> if the sql create to many partitions , the database can't hold so many 
> connections and return exception.
> In above situation is 200 because of the "group by" and 
> "spark.sql.shuffle.partitions"
> the spark source code JdbcUtil is
> {code}
> def saveTable(
>   df: DataFrame,
>   url: String,
>   table: String,
>   properties: Properties) {
> val dialect = JdbcDialects.get(url)
> val nullTypes: Array[Int] = df.schema.fields.map { field =>
>   getJdbcType(field.dataType, dialect).jdbcNullType
> }
> val rddSchema = df.schema
> val getConnection: () => Connection = createConnectionFactory(url, 
> properties)
> val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, 
> "1000").toInt
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
> batchSize, dialect)
> }
>   }
> {code}
> May be we can add a property for df.repartition(num).foreachPartition ?
> In fact I got an exception "ORA-12519, TNS:no appropriate service handler 
> found"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-11 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658562#comment-15658562
 ] 

lichenglin edited comment on SPARK-18413 at 11/11/16 11:57 PM:
---

{code}
CREATE or replace TEMPORARY VIEW resultview
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
  dbtable "result",
  user "HIVE",
  password "HIVE"
  numPartitions "10"
);
{code}
May be we can use numPartitions.
until now numPartitions is just for reading


was (Author: licl):
{code}
CREATE or replace TEMPORARY VIEW resultview
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
  dbtable "result",
  user "HIVE",
  password "HIVE"
  numPartitions 10
);
{code}
May be we can use numPartitions.
util now numPartitions is just for reading

> Add a property to control the number of partitions when save a jdbc rdd
> ---
>
> Key: SPARK-18413
> URL: https://issues.apache.org/jira/browse/SPARK-18413
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> {code}
> CREATE or replace TEMPORARY VIEW resultview
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
>   dbtable "result",
>   user "HIVE",
>   password "HIVE"
> );
> --set spark.sql.shuffle.partitions=200
> insert overwrite table resultview select g,count(1) as count from 
> tnet.DT_LIVE_INFO group by g
> {code}
> I'm tring to save a spark sql result to oracle.
> And I found spark will create a jdbc connection for each partition.
> if the sql create to many partitions , the database can't hold so many 
> connections and return exception.
> In above situation is 200 because of the "group by" and 
> "spark.sql.shuffle.partitions"
> the spark source code JdbcUtil is
> {code}
> def saveTable(
>   df: DataFrame,
>   url: String,
>   table: String,
>   properties: Properties) {
> val dialect = JdbcDialects.get(url)
> val nullTypes: Array[Int] = df.schema.fields.map { field =>
>   getJdbcType(field.dataType, dialect).jdbcNullType
> }
> val rddSchema = df.schema
> val getConnection: () => Connection = createConnectionFactory(url, 
> properties)
> val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, 
> "1000").toInt
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
> batchSize, dialect)
> }
>   }
> {code}
> May be we can add a property for df.repartition(num).foreachPartition ?
> In fact I got an exception "ORA-12519, TNS:no appropriate service handler 
> found"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-11 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658562#comment-15658562
 ] 

lichenglin commented on SPARK-18413:


{code}
CREATE or replace TEMPORARY VIEW resultview
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
  dbtable "result",
  user "HIVE",
  password "HIVE"
  numPartitions 10
);
{code}
May be we can use numPartitions.
util now numPartitions is just for reading

> Add a property to control the number of partitions when save a jdbc rdd
> ---
>
> Key: SPARK-18413
> URL: https://issues.apache.org/jira/browse/SPARK-18413
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> {code}
> CREATE or replace TEMPORARY VIEW resultview
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
>   dbtable "result",
>   user "HIVE",
>   password "HIVE"
> );
> --set spark.sql.shuffle.partitions=200
> insert overwrite table resultview select g,count(1) as count from 
> tnet.DT_LIVE_INFO group by g
> {code}
> I'm tring to save a spark sql result to oracle.
> And I found spark will create a jdbc connection for each partition.
> if the sql create to many partitions , the database can't hold so many 
> connections and return exception.
> In above situation is 200 because of the "group by" and 
> "spark.sql.shuffle.partitions"
> the spark source code JdbcUtil is
> {code}
> def saveTable(
>   df: DataFrame,
>   url: String,
>   table: String,
>   properties: Properties) {
> val dialect = JdbcDialects.get(url)
> val nullTypes: Array[Int] = df.schema.fields.map { field =>
>   getJdbcType(field.dataType, dialect).jdbcNullType
> }
> val rddSchema = df.schema
> val getConnection: () => Connection = createConnectionFactory(url, 
> properties)
> val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, 
> "1000").toInt
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
> batchSize, dialect)
> }
>   }
> {code}
> May be we can add a property for df.repartition(num).foreachPartition ?
> In fact I got an exception "ORA-12519, TNS:no appropriate service handler 
> found"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-11 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658490#comment-15658490
 ] 

lichenglin commented on SPARK-18413:


I'm using spark sql,and how to call repartition with sql???

> Add a property to control the number of partitions when save a jdbc rdd
> ---
>
> Key: SPARK-18413
> URL: https://issues.apache.org/jira/browse/SPARK-18413
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> {code}
> CREATE or replace TEMPORARY VIEW resultview
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
>   dbtable "result",
>   user "HIVE",
>   password "HIVE"
> );
> --set spark.sql.shuffle.partitions=200
> insert overwrite table resultview select g,count(1) as count from 
> tnet.DT_LIVE_INFO group by g
> {code}
> I'm tring to save a spark sql result to oracle.
> And I found spark will create a jdbc connection for each partition.
> if the sql create to many partitions , the database can't hold so many 
> connections and return exception.
> In above situation is 200 because of the "group by" and 
> "spark.sql.shuffle.partitions"
> the spark source code JdbcUtil is
> {code}
> def saveTable(
>   df: DataFrame,
>   url: String,
>   table: String,
>   properties: Properties) {
> val dialect = JdbcDialects.get(url)
> val nullTypes: Array[Int] = df.schema.fields.map { field =>
>   getJdbcType(field.dataType, dialect).jdbcNullType
> }
> val rddSchema = df.schema
> val getConnection: () => Connection = createConnectionFactory(url, 
> properties)
> val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, 
> "1000").toInt
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
> batchSize, dialect)
> }
>   }
> {code}
> May be we can add a property for df.repartition(num).foreachPartition ?
> In fact I got an exception "ORA-12519, TNS:no appropriate service handler 
> found"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-11 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin reopened SPARK-18413:


> Add a property to control the number of partitions when save a jdbc rdd
> ---
>
> Key: SPARK-18413
> URL: https://issues.apache.org/jira/browse/SPARK-18413
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> {code}
> CREATE or replace TEMPORARY VIEW resultview
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
>   dbtable "result",
>   user "HIVE",
>   password "HIVE"
> );
> --set spark.sql.shuffle.partitions=200
> insert overwrite table resultview select g,count(1) as count from 
> tnet.DT_LIVE_INFO group by g
> {code}
> I'm tring to save a spark sql result to oracle.
> And I found spark will create a jdbc connection for each partition.
> if the sql create to many partitions , the database can't hold so many 
> connections and return exception.
> In above situation is 200 because of the "group by" and 
> "spark.sql.shuffle.partitions"
> the spark source code JdbcUtil is
> {code}
> def saveTable(
>   df: DataFrame,
>   url: String,
>   table: String,
>   properties: Properties) {
> val dialect = JdbcDialects.get(url)
> val nullTypes: Array[Int] = df.schema.fields.map { field =>
>   getJdbcType(field.dataType, dialect).jdbcNullType
> }
> val rddSchema = df.schema
> val getConnection: () => Connection = createConnectionFactory(url, 
> properties)
> val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, 
> "1000").toInt
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
> batchSize, dialect)
> }
>   }
> {code}
> May be we can add a property for df.repartition(num).foreachPartition ?
> In fact I got an exception "ORA-12519, TNS:no appropriate service handler 
> found"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-11 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-18413:
---
Description: 
{code}
CREATE or replace TEMPORARY VIEW resultview
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
  dbtable "result",
  user "HIVE",
  password "HIVE"
);
--set spark.sql.shuffle.partitions=200
insert overwrite table resultview select g,count(1) as count from 
tnet.DT_LIVE_INFO group by g
{code}

I'm tring to save a spark sql result to oracle.
And I found spark will create a jdbc connection for each partition.
if the sql create to many partitions , the database can't hold so many 
connections and return exception.

In above situation is 200 because of the "group by" and 
"spark.sql.shuffle.partitions"

the spark source code JdbcUtil is
{code}
def saveTable(
  df: DataFrame,
  url: String,
  table: String,
  properties: Properties) {
val dialect = JdbcDialects.get(url)
val nullTypes: Array[Int] = df.schema.fields.map { field =>
  getJdbcType(field.dataType, dialect).jdbcNullType
}

val rddSchema = df.schema
val getConnection: () => Connection = createConnectionFactory(url, 
properties)
val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, "1000").toInt
df.foreachPartition { iterator =>
  savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
batchSize, dialect)
}
  }
{code}

May be we can add a property for df.repartition(num).foreachPartition ?

In fact I got an exception "ORA-12519, TNS:no appropriate service handler found"

  was:
{code}
CREATE or replace TEMPORARY VIEW resultview
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
  dbtable "result",
  user "HIVE",
  password "HIVE"
);
--set spark.sql.shuffle.partitions=200
insert overwrite table resultview select g,count(1) as count from 
tnet.DT_LIVE_INFO group by g
{code}

I'm tring to save a spark sql result to oracle.
And I found spark will create a jdbc connection for each partition.
if the sql create to many partitions , the database can't hold so many 
connections and return exception.

In above situation is 200 because of the "group by" and 
"spark.sql.shuffle.partitions"

the spark source code JdbcUtil is
{code}
def saveTable(
  df: DataFrame,
  url: String,
  table: String,
  properties: Properties) {
val dialect = JdbcDialects.get(url)
val nullTypes: Array[Int] = df.schema.fields.map { field =>
  getJdbcType(field.dataType, dialect).jdbcNullType
}

val rddSchema = df.schema
val getConnection: () => Connection = createConnectionFactory(url, 
properties)
val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, "1000").toInt
df.foreachPartition { iterator =>
  savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
batchSize, dialect)
}
  }
{code}

May be we can add a property for df.repartition(num).foreachPartition ?



> Add a property to control the number of partitions when save a jdbc rdd
> ---
>
> Key: SPARK-18413
> URL: https://issues.apache.org/jira/browse/SPARK-18413
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> {code}
> CREATE or replace TEMPORARY VIEW resultview
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
>   dbtable "result",
>   user "HIVE",
>   password "HIVE"
> );
> --set spark.sql.shuffle.partitions=200
> insert overwrite table resultview select g,count(1) as count from 
> tnet.DT_LIVE_INFO group by g
> {code}
> I'm tring to save a spark sql result to oracle.
> And I found spark will create a jdbc connection for each partition.
> if the sql create to many partitions , the database can't hold so many 
> connections and return exception.
> In above situation is 200 because of the "group by" and 
> "spark.sql.shuffle.partitions"
> the spark source code JdbcUtil is
> {code}
> def saveTable(
>   df: DataFrame,
>   url: String,
>   table: String,
>   properties: Properties) {
> val dialect = JdbcDialects.get(url)
> val nullTypes: Array[Int] = df.schema.fields.map { field =>
>   getJdbcType(field.dataType, dialect).jdbcNullType
> }
> val rddSchema = df.schema
> val getConnection: () => Connection = createConnectionFactory(url, 
> properties)
> val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, 
> "1000").toInt
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
> batchSize, dialect)
> }
>   }
> {code}
> May be we can add a property for df.repartition(num).foreachPartition ?
> In fact I got an exception "ORA-12519, 

[jira] [Closed] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-11 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin closed SPARK-18413.
--
Resolution: Invalid

> Add a property to control the number of partitions when save a jdbc rdd
> ---
>
> Key: SPARK-18413
> URL: https://issues.apache.org/jira/browse/SPARK-18413
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> {code}
> CREATE or replace TEMPORARY VIEW resultview
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
>   dbtable "result",
>   user "HIVE",
>   password "HIVE"
> );
> --set spark.sql.shuffle.partitions=200
> insert overwrite table resultview select g,count(1) as count from 
> tnet.DT_LIVE_INFO group by g
> {code}
> I'm tring to save a spark sql result to oracle.
> And I found spark will create a jdbc connection for each partition.
> if the sql create to many partitions , the database can't hold so many 
> connections and return exception.
> In above situation is 200 because of the "group by" and 
> "spark.sql.shuffle.partitions"
> the spark source code JdbcUtil is
> {code}
> def saveTable(
>   df: DataFrame,
>   url: String,
>   table: String,
>   properties: Properties) {
> val dialect = JdbcDialects.get(url)
> val nullTypes: Array[Int] = df.schema.fields.map { field =>
>   getJdbcType(field.dataType, dialect).jdbcNullType
> }
> val rddSchema = df.schema
> val getConnection: () => Connection = createConnectionFactory(url, 
> properties)
> val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, 
> "1000").toInt
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
> batchSize, dialect)
> }
>   }
> {code}
> May be we can add a property for df.repartition(num).foreachPartition ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-11 Thread lichenglin (JIRA)
lichenglin created SPARK-18413:
--

 Summary: Add a property to control the number of partitions when 
save a jdbc rdd
 Key: SPARK-18413
 URL: https://issues.apache.org/jira/browse/SPARK-18413
 Project: Spark
  Issue Type: Wish
  Components: SQL
Affects Versions: 2.0.1
Reporter: lichenglin


{code}
CREATE or replace TEMPORARY VIEW resultview
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
  dbtable "result",
  user "HIVE",
  password "HIVE"
);
--set spark.sql.shuffle.partitions=200
insert overwrite table resultview select g,count(1) as count from 
tnet.DT_LIVE_INFO group by g
{code}

I'm tring to save a spark sql result to oracle.
And I found spark will create a jdbc connection for each partition.
if the sql create to many partitions , the database can't hold so many 
connections and return exception.

In above situation is 200 because of the "group by" and 
"spark.sql.shuffle.partitions"

the spark source code JdbcUtil is
{code}
def saveTable(
  df: DataFrame,
  url: String,
  table: String,
  properties: Properties) {
val dialect = JdbcDialects.get(url)
val nullTypes: Array[Int] = df.schema.fields.map { field =>
  getJdbcType(field.dataType, dialect).jdbcNullType
}

val rddSchema = df.schema
val getConnection: () => Connection = createConnectionFactory(url, 
properties)
val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, "1000").toInt
df.foreachPartition { iterator =>
  savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
batchSize, dialect)
}
  }
{code}

May be we can add a property for df.repartition(num).foreachPartition ?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17898) --repositories needs username and password

2016-10-16 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580857#comment-15580857
 ] 

lichenglin commented on SPARK-17898:


I have found a way to declaration the username and password:

--repositories 
http://username:passw...@wx.bjdv.com:8081/nexus/content/grou‌​ps/bigdata/

May be,we should add this feature to the document?



> --repositories  needs username and password
> ---
>
> Key: SPARK-17898
> URL: https://issues.apache.org/jira/browse/SPARK-17898
> Project: Spark
>  Issue Type: Wish
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> My private repositories need username and password to visit.
> I can't find a way to declaration  the username and password when submit 
> spark application
> {code}
> bin/spark-submit --repositories   
> http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages 
> com.databricks:spark-csv_2.10:1.2.0   --class 
> org.apache.spark.examples.SparkPi   --master local[8]   
> examples/jars/spark-examples_2.11-2.0.1.jar   100
> {code}
> The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username 
> and password



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17898) --repositories needs username and password

2016-10-13 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573929#comment-15573929
 ] 

lichenglin edited comment on SPARK-17898 at 10/14/16 2:41 AM:
--

I know it.

But how to build these dependencies into my jar.

Could you give me an example with gradle??

Really thanks a lot.

I have tried make a jar with mysql-jdbc.jar.

And I have change the mainfest's classpath.

It Failed.


was (Author: licl):
I know it.

But how to build these dependencies into my jar.

Could you give me an example with gradle??

Really thanks a lot.

> --repositories  needs username and password
> ---
>
> Key: SPARK-17898
> URL: https://issues.apache.org/jira/browse/SPARK-17898
> Project: Spark
>  Issue Type: Wish
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> My private repositories need username and password to visit.
> I can't find a way to declaration  the username and password when submit 
> spark application
> {code}
> bin/spark-submit --repositories   
> http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages 
> com.databricks:spark-csv_2.10:1.2.0   --class 
> org.apache.spark.examples.SparkPi   --master local[8]   
> examples/jars/spark-examples_2.11-2.0.1.jar   100
> {code}
> The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username 
> and password



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17898) --repositories needs username and password

2016-10-13 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573929#comment-15573929
 ] 

lichenglin commented on SPARK-17898:


I know it.

But how to build these dependencies into my jar.

Could you give me an example with gradle??

Really thanks a lot.

> --repositories  needs username and password
> ---
>
> Key: SPARK-17898
> URL: https://issues.apache.org/jira/browse/SPARK-17898
> Project: Spark
>  Issue Type: Wish
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> My private repositories need username and password to visit.
> I can't find a way to declaration  the username and password when submit 
> spark application
> {code}
> bin/spark-submit --repositories   
> http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages 
> com.databricks:spark-csv_2.10:1.2.0   --class 
> org.apache.spark.examples.SparkPi   --master local[8]   
> examples/jars/spark-examples_2.11-2.0.1.jar   100
> {code}
> The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username 
> and password



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17898) --repositories needs username and password

2016-10-12 Thread lichenglin (JIRA)
lichenglin created SPARK-17898:
--

 Summary: --repositories  needs username and password
 Key: SPARK-17898
 URL: https://issues.apache.org/jira/browse/SPARK-17898
 Project: Spark
  Issue Type: Wish
Affects Versions: 2.0.1
Reporter: lichenglin


My private repositories need username and password to visit.

I can't find a way to declaration  the username and password when submit spark 
application
{code}
bin/spark-submit --repositories   
http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages 
com.databricks:spark-csv_2.10:1.2.0   --class org.apache.spark.examples.SparkPi 
  --master local[8]   examples/jars/spark-examples_2.11-2.0.1.jar   100
{code}

The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username and 
password



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16517) can't add columns on the table create by spark'writer

2016-07-13 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-16517:
---
Summary: can't add columns on the table  create by spark'writer  (was: 
can't add columns on the parquet table )

> can't add columns on the table  create by spark'writer
> --
>
> Key: SPARK-16517
> URL: https://issues.apache.org/jira/browse/SPARK-16517
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: lichenglin
>
> {code}
> setName("abc");
> HiveContext hive = getHiveContext();
> DataFrame d = hive.createDataFrame(
>   getJavaSparkContext().parallelize(
>   
> Arrays.asList(RowFactory.create("abc", "abc", 5.0), RowFactory.create("abcd", 
> "abcd", 5.0))),
>   DataTypes.createStructType(
>   
> Arrays.asList(DataTypes.createStructField("card_id", DataTypes.StringType, 
> true),
>   
> DataTypes.createStructField("tag_name", DataTypes.StringType, true),
>   
> DataTypes.createStructField("v", DataTypes.DoubleType, true;
> d.write().partitionBy("v").mode(SaveMode.Overwrite).saveAsTable("abc");
> hive.sql("alter table abc add columns(v2 double)");
> hive.refreshTable("abc");
> hive.sql("describe abc").show();
> DataFrame d2 = hive.createDataFrame(
>   
> getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("abc", 
> "abc", 3.0, 4.0),
>   RowFactory.create("abcd", 
> "abcd", 3.0, 1.0))),
>   new StructType(new StructField[] { 
> DataTypes.createStructField("card_id", DataTypes.StringType, true),
>   
> DataTypes.createStructField("tag_name", DataTypes.StringType, true),
>   
> DataTypes.createStructField("v", DataTypes.DoubleType, true),
>   
> DataTypes.createStructField("v2", DataTypes.DoubleType, true) }));
> d2.write().partitionBy("v").mode(SaveMode.Append).saveAsTable("abc");
> hive.table("abc").show();
> {code}
> spark.sql.parquet.mergeSchema has been set to  "true".
> The code's exception is here 
> {code}
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | card_id|   string|   |
> |tag_name|   string|   |
> |   v|   double|   |
> ++-+---+
> 2016-07-13 13:40:43,637 INFO 
> [org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : 
> db=default tbl=abc
> 2016-07-13 13:40:43,637 INFO 
> [org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=abc  
> 2016-07-13 13:40:43,693 INFO [org.apache.spark.storage.BlockManagerInfo:58] - 
> Removed broadcast_2_piece0 on localhost:50647 in memory (size: 1176.0 B, 
> free: 1125.7 MB)
> 2016-07-13 13:40:43,700 INFO [org.apache.spark.ContextCleaner:58] - Cleaned 
> accumulator 2
> 2016-07-13 13:40:43,702 INFO [org.apache.spark.storage.BlockManagerInfo:58] - 
> Removed broadcast_1_piece0 on localhost:50647 in memory (size: 19.4 KB, free: 
> 1125.7 MB)
> 2016-07-13 13:40:43,702 INFO 
> [org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : 
> db=default tbl=abc
> 2016-07-13 13:40:43,703 INFO 
> [org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=abc  
> Exception in thread "main" java.lang.RuntimeException: 
> Relation[card_id#26,tag_name#27,v#28] ParquetRelation
>  requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE 
> statement generates the same number of columns as its schema.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:68)
>   at 
> org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:58)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:249)
>   at 
> 

[jira] [Updated] (SPARK-16517) can't add columns on the parquet table

2016-07-13 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-16517:
---
Summary: can't add columns on the parquet table   (was: can't add columns 
on the table witch column metadata is serializer)

> can't add columns on the parquet table 
> ---
>
> Key: SPARK-16517
> URL: https://issues.apache.org/jira/browse/SPARK-16517
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: lichenglin
>
> {code}
> setName("abc");
> HiveContext hive = getHiveContext();
> DataFrame d = hive.createDataFrame(
>   getJavaSparkContext().parallelize(
>   
> Arrays.asList(RowFactory.create("abc", "abc", 5.0), RowFactory.create("abcd", 
> "abcd", 5.0))),
>   DataTypes.createStructType(
>   
> Arrays.asList(DataTypes.createStructField("card_id", DataTypes.StringType, 
> true),
>   
> DataTypes.createStructField("tag_name", DataTypes.StringType, true),
>   
> DataTypes.createStructField("v", DataTypes.DoubleType, true;
> d.write().partitionBy("v").mode(SaveMode.Overwrite).saveAsTable("abc");
> hive.sql("alter table abc add columns(v2 double)");
> hive.refreshTable("abc");
> hive.sql("describe abc").show();
> DataFrame d2 = hive.createDataFrame(
>   
> getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("abc", 
> "abc", 3.0, 4.0),
>   RowFactory.create("abcd", 
> "abcd", 3.0, 1.0))),
>   new StructType(new StructField[] { 
> DataTypes.createStructField("card_id", DataTypes.StringType, true),
>   
> DataTypes.createStructField("tag_name", DataTypes.StringType, true),
>   
> DataTypes.createStructField("v", DataTypes.DoubleType, true),
>   
> DataTypes.createStructField("v2", DataTypes.DoubleType, true) }));
> d2.write().partitionBy("v").mode(SaveMode.Append).saveAsTable("abc");
> hive.table("abc").show();
> {code}
> spark.sql.parquet.mergeSchema has been set to  "true".
> The code's exception is here 
> {code}
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | card_id|   string|   |
> |tag_name|   string|   |
> |   v|   double|   |
> ++-+---+
> 2016-07-13 13:40:43,637 INFO 
> [org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : 
> db=default tbl=abc
> 2016-07-13 13:40:43,637 INFO 
> [org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=abc  
> 2016-07-13 13:40:43,693 INFO [org.apache.spark.storage.BlockManagerInfo:58] - 
> Removed broadcast_2_piece0 on localhost:50647 in memory (size: 1176.0 B, 
> free: 1125.7 MB)
> 2016-07-13 13:40:43,700 INFO [org.apache.spark.ContextCleaner:58] - Cleaned 
> accumulator 2
> 2016-07-13 13:40:43,702 INFO [org.apache.spark.storage.BlockManagerInfo:58] - 
> Removed broadcast_1_piece0 on localhost:50647 in memory (size: 19.4 KB, free: 
> 1125.7 MB)
> 2016-07-13 13:40:43,702 INFO 
> [org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : 
> db=default tbl=abc
> 2016-07-13 13:40:43,703 INFO 
> [org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=abc  
> Exception in thread "main" java.lang.RuntimeException: 
> Relation[card_id#26,tag_name#27,v#28] ParquetRelation
>  requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE 
> statement generates the same number of columns as its schema.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:68)
>   at 
> org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:58)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:249)
>   at 
> org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:58)

[jira] [Updated] (SPARK-16517) can't add columns on the table witch column metadata is serializer

2016-07-12 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-16517:
---
Description: 
{code}
setName("abc");
HiveContext hive = getHiveContext();
DataFrame d = hive.createDataFrame(
getJavaSparkContext().parallelize(

Arrays.asList(RowFactory.create("abc", "abc", 5.0), RowFactory.create("abcd", 
"abcd", 5.0))),
DataTypes.createStructType(

Arrays.asList(DataTypes.createStructField("card_id", DataTypes.StringType, 
true),

DataTypes.createStructField("tag_name", DataTypes.StringType, true),

DataTypes.createStructField("v", DataTypes.DoubleType, true;
d.write().partitionBy("v").mode(SaveMode.Overwrite).saveAsTable("abc");
hive.sql("alter table abc add columns(v2 double)");
hive.refreshTable("abc");
hive.sql("describe abc").show();
DataFrame d2 = hive.createDataFrame(

getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("abc", "abc", 
3.0, 4.0),
RowFactory.create("abcd", 
"abcd", 3.0, 1.0))),
new StructType(new StructField[] { 
DataTypes.createStructField("card_id", DataTypes.StringType, true),

DataTypes.createStructField("tag_name", DataTypes.StringType, true),

DataTypes.createStructField("v", DataTypes.DoubleType, true),

DataTypes.createStructField("v2", DataTypes.DoubleType, true) }));
d2.write().partitionBy("v").mode(SaveMode.Append).saveAsTable("abc");
hive.table("abc").show();
{code}
spark.sql.parquet.mergeSchema has been set to  "true".

The code's exception is here 
{code}

++-+---+
|col_name|data_type|comment|
++-+---+
| card_id|   string|   |
|tag_name|   string|   |
|   v|   double|   |
++-+---+

2016-07-13 13:40:43,637 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : 
db=default tbl=abc
2016-07-13 13:40:43,637 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl  
ip=unknown-ip-addr  cmd=get_table : db=default tbl=abc  
2016-07-13 13:40:43,693 INFO [org.apache.spark.storage.BlockManagerInfo:58] - 
Removed broadcast_2_piece0 on localhost:50647 in memory (size: 1176.0 B, free: 
1125.7 MB)
2016-07-13 13:40:43,700 INFO [org.apache.spark.ContextCleaner:58] - Cleaned 
accumulator 2
2016-07-13 13:40:43,702 INFO [org.apache.spark.storage.BlockManagerInfo:58] - 
Removed broadcast_1_piece0 on localhost:50647 in memory (size: 19.4 KB, free: 
1125.7 MB)
2016-07-13 13:40:43,702 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : 
db=default tbl=abc
2016-07-13 13:40:43,703 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl  
ip=unknown-ip-addr  cmd=get_table : db=default tbl=abc  
Exception in thread "main" java.lang.RuntimeException: 
Relation[card_id#26,tag_name#27,v#28] ParquetRelation
 requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE 
statement generates the same number of columns as its schema.
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:68)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:58)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:249)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:58)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:57)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at 

[jira] [Updated] (SPARK-16517) can't add columns on the table witch column metadata is serializer

2016-07-12 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-16517:
---
Description: 
{code}
setName("abc");
HiveContext hive = getHiveContext();
DataFrame d = hive.createDataFrame(
getJavaSparkContext().parallelize(

Arrays.asList(RowFactory.create("abc", "abc", 5.0), RowFactory.create("abcd", 
"abcd", 5.0))),
DataTypes.createStructType(

Arrays.asList(DataTypes.createStructField("card_id", DataTypes.StringType, 
true),

DataTypes.createStructField("tag_name", DataTypes.StringType, true),

DataTypes.createStructField("v", DataTypes.DoubleType, true;

d.write().partitionBy("v").mode(SaveMode.Overwrite).saveAsTable("abc");
hive.sql("alter table abc add columns(v2 double)");
hive.refreshTable("abc");
hive.sql("describe abc").show();
DataFrame d2 = hive.createDataFrame(

getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("abc", "abc", 
3.0, 4.0),
RowFactory.create("abcd", 
"abcd", 3.0, 1.0))),
new StructType(new StructField[] { 
DataTypes.createStructField("card_id", DataTypes.StringType, true),

DataTypes.createStructField("tag_name", DataTypes.StringType, true),

DataTypes.createStructField("v", DataTypes.DoubleType, true),

DataTypes.createStructField("v2", DataTypes.DoubleType, true) }));

d2.write().partitionBy("v").mode(SaveMode.Append).saveAsTable("abc");
hive.table("abc").show();
{code}
spark.sql.parquet.mergeSchema has been set to  "true".

The code's exception is here 
{code}

++-+---+
|col_name|data_type|comment|
++-+---+
| card_id|   string|   |
|tag_name|   string|   |
|   v|   double|   |
++-+---+

2016-07-13 13:40:43,637 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : 
db=default tbl=abc
2016-07-13 13:40:43,637 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl  
ip=unknown-ip-addr  cmd=get_table : db=default tbl=abc  
2016-07-13 13:40:43,693 INFO [org.apache.spark.storage.BlockManagerInfo:58] - 
Removed broadcast_2_piece0 on localhost:50647 in memory (size: 1176.0 B, free: 
1125.7 MB)
2016-07-13 13:40:43,700 INFO [org.apache.spark.ContextCleaner:58] - Cleaned 
accumulator 2
2016-07-13 13:40:43,702 INFO [org.apache.spark.storage.BlockManagerInfo:58] - 
Removed broadcast_1_piece0 on localhost:50647 in memory (size: 19.4 KB, free: 
1125.7 MB)
2016-07-13 13:40:43,702 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : 
db=default tbl=abc
2016-07-13 13:40:43,703 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl  
ip=unknown-ip-addr  cmd=get_table : db=default tbl=abc  
Exception in thread "main" java.lang.RuntimeException: 
Relation[card_id#26,tag_name#27,v#28] ParquetRelation
 requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE 
statement generates the same number of columns as its schema.
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:68)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:58)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:249)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:58)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:57)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at 

[jira] [Created] (SPARK-16517) can't add columns on the table witch column metadata is serializer

2016-07-12 Thread lichenglin (JIRA)
lichenglin created SPARK-16517:
--

 Summary: can't add columns on the table witch column metadata is 
serializer
 Key: SPARK-16517
 URL: https://issues.apache.org/jira/browse/SPARK-16517
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.2
Reporter: lichenglin


{code}
setName("abc");
HiveContext hive = getHiveContext();
DataFrame d = hive.createDataFrame(
getJavaSparkContext().parallelize(

Arrays.asList(RowFactory.create("abc", "abc", 5.0), RowFactory.create("abcd", 
"abcd", 5.0))),
DataTypes.createStructType(

Arrays.asList(DataTypes.createStructField("card_id", DataTypes.StringType, 
true),

DataTypes.createStructField("tag_name", DataTypes.StringType, true),

DataTypes.createStructField("v", DataTypes.DoubleType, true;

d.write().partitionBy("v").mode(SaveMode.Overwrite).saveAsTable("abc");
hive.sql("alter table abc add columns(v2 double)");
hive.refreshTable("abc");
hive.sql("describe abc").show();
DataFrame d2 = hive.createDataFrame(

getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("abc", "abc", 
3.0, 4.0),
RowFactory.create("abcd", 
"abcd", 3.0, 1.0))),
new StructType(new StructField[] { 
DataTypes.createStructField("card_id", DataTypes.StringType, true),

DataTypes.createStructField("tag_name", DataTypes.StringType, true),

DataTypes.createStructField("v", DataTypes.DoubleType, true),

DataTypes.createStructField("v2", DataTypes.DoubleType, true) }));

d2.write().partitionBy("v").mode(SaveMode.Append).saveAsTable("abc");
hive.table("abc").show();
{code}
spark.sql.parquet.mergeSchema has been set to  "true".

The code's exception is here 
{code}

++-+---+
|col_name|data_type|comment|
++-+---+
| card_id|   string|   |
|tag_name|   string|   |
|   v|   double|   |
++-+---+

2016-07-13 13:40:43,637 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : 
db=default tbl=abc
2016-07-13 13:40:43,637 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl  
ip=unknown-ip-addr  cmd=get_table : db=default tbl=abc  
2016-07-13 13:40:43,693 INFO [org.apache.spark.storage.BlockManagerInfo:58] - 
Removed broadcast_2_piece0 on localhost:50647 in memory (size: 1176.0 B, free: 
1125.7 MB)
2016-07-13 13:40:43,700 INFO [org.apache.spark.ContextCleaner:58] - Cleaned 
accumulator 2
2016-07-13 13:40:43,702 INFO [org.apache.spark.storage.BlockManagerInfo:58] - 
Removed broadcast_1_piece0 on localhost:50647 in memory (size: 19.4 KB, free: 
1125.7 MB)
2016-07-13 13:40:43,702 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore:746] - 0: get_table : 
db=default tbl=abc
2016-07-13 13:40:43,703 INFO 
[org.apache.hadoop.hive.metastore.HiveMetaStore.audit:371] - ugi=licl  
ip=unknown-ip-addr  cmd=get_table : db=default tbl=abc  
Exception in thread "main" java.lang.RuntimeException: 
Relation[card_id#26,tag_name#27,v#28] ParquetRelation
 requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE 
statement generates the same number of columns as its schema.
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:68)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$$anonfun$apply$2.applyOrElse(rules.scala:58)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:249)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:58)
at 
org.apache.spark.sql.execution.datasources.PreInsertCastAndRename$.apply(rules.scala:57)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
at 

[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields

2016-07-05 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15362180#comment-15362180
 ] 

lichenglin commented on SPARK-16361:


I think a cube with just about 10 fields is familiar in OLAP system.

Is there anything that we can do to reduce the memory for building cube??

> It takes a long time for gc when building cube with  many fields
> 
>
> Key: SPARK-16361
> URL: https://issues.apache.org/jira/browse/SPARK-16361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: lichenglin
>
> I'm using spark to build cube on a dataframe with 1m data.
> I found that when I add too many fields (about 8 or above) 
> the worker takes a lot of time for GC.
> I try to increase the memory of each worker but it not work well.
> but I don't know why,sorry.
> here is my simple code and monitoring 
> Cuber is a util class for building cube.
> {code:title=Bar.java|borderStyle=solid}
>   sqlContext.udf().register("jidu", (Integer f) -> {
>   return (f - 1) / 3 + 1;
>   } , DataTypes.IntegerType);
>   DataFrame d = 
> sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as 
> double) as c_age",
>   "month(day) as month", "year(day) as year", 
> "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
>   "jidu(month(day)) as jidu");
>   Bucketizer b = new 
> Bucketizer().setInputCol("c_age").setSplits(new double[] { 
> Double.NEGATIVE_INFINITY, 0, 10,
>   20, 30, 40, 50, 60, 70, 80, 90, 100, 
> Double.POSITIVE_INFINITY }).setOutputCol("age");
>   DataFrame cube = new Cuber(b.transform(d))
>   .addFields("day", "AREA_CODE", "CUST_TYPE", 
> "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
>   .min("age").sum("zwsc").count().buildcube();
>   
> cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
> {code}
> Summary Metrics for 12 Completed Tasks
> MetricMin 25th percentile Median  75th percentile Max
> Duration  2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
> GC Time   1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
> Shuffle Read Size / Records   728.4 KB / 21886736.6 KB / 22258
> 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783
> Shuffle Write Size / Records  74.3 MB / 1926282   75.8 MB / 1965860   
> 76.2 MB / 1976004   76.4 MB / 1981516   77.9 MB / 2021142



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16361) It takes a long time for gc when building cube with many fields

2016-07-04 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-16361:
---
Comment: was deleted

(was: I have set master url in java application.

here is a copy from spark master's ui. 

Completed Applications

Application ID  NameCores   Memory per Node Submitted Time  UserState   
Duration
app-20160704175221-0090 no-name 12  40.0 GB 2016/07/04 17:52:21 root
KILLED  2.1 h
app-20160704174141-0089 no-name 12  20.0 GB 2016/07/04 17:41:41 root
KILLED  10 min

I add another two fields into the cube.

the jobs both crash down.)

> It takes a long time for gc when building cube with  many fields
> 
>
> Key: SPARK-16361
> URL: https://issues.apache.org/jira/browse/SPARK-16361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: lichenglin
>
> I'm using spark to build cube on a dataframe with 1m data.
> I found that when I add too many fields (about 8 or above) 
> the worker takes a lot of time for GC.
> I try to increase the memory of each worker but it not work well.
> but I don't know why,sorry.
> here is my simple code and monitoring 
> Cuber is a util class for building cube.
> {code:title=Bar.java|borderStyle=solid}
>   sqlContext.udf().register("jidu", (Integer f) -> {
>   return (f - 1) / 3 + 1;
>   } , DataTypes.IntegerType);
>   DataFrame d = 
> sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as 
> double) as c_age",
>   "month(day) as month", "year(day) as year", 
> "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
>   "jidu(month(day)) as jidu");
>   Bucketizer b = new 
> Bucketizer().setInputCol("c_age").setSplits(new double[] { 
> Double.NEGATIVE_INFINITY, 0, 10,
>   20, 30, 40, 50, 60, 70, 80, 90, 100, 
> Double.POSITIVE_INFINITY }).setOutputCol("age");
>   DataFrame cube = new Cuber(b.transform(d))
>   .addFields("day", "AREA_CODE", "CUST_TYPE", 
> "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
>   .min("age").sum("zwsc").count().buildcube();
>   
> cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
> {code}
> Summary Metrics for 12 Completed Tasks
> MetricMin 25th percentile Median  75th percentile Max
> Duration  2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
> GC Time   1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
> Shuffle Read Size / Records   728.4 KB / 21886736.6 KB / 22258
> 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783
> Shuffle Write Size / Records  74.3 MB / 1926282   75.8 MB / 1965860   
> 76.2 MB / 1976004   76.4 MB / 1981516   77.9 MB / 2021142



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields

2016-07-04 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15361869#comment-15361869
 ] 

lichenglin commented on SPARK-16361:


I have set master url in java application.

here is a copy from spark master's ui. 

Completed Applications

Application ID  NameCores   Memory per Node Submitted Time  UserState   
Duration
app-20160704175221-0090 no-name 12  40.0 GB 2016/07/04 17:52:21 root
KILLED  2.1 h
app-20160704174141-0089 no-name 12  20.0 GB 2016/07/04 17:41:41 root
KILLED  10 min

I add another two fields into the cube.

the jobs both crash down.

> It takes a long time for gc when building cube with  many fields
> 
>
> Key: SPARK-16361
> URL: https://issues.apache.org/jira/browse/SPARK-16361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: lichenglin
>
> I'm using spark to build cube on a dataframe with 1m data.
> I found that when I add too many fields (about 8 or above) 
> the worker takes a lot of time for GC.
> I try to increase the memory of each worker but it not work well.
> but I don't know why,sorry.
> here is my simple code and monitoring 
> Cuber is a util class for building cube.
> {code:title=Bar.java|borderStyle=solid}
>   sqlContext.udf().register("jidu", (Integer f) -> {
>   return (f - 1) / 3 + 1;
>   } , DataTypes.IntegerType);
>   DataFrame d = 
> sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as 
> double) as c_age",
>   "month(day) as month", "year(day) as year", 
> "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
>   "jidu(month(day)) as jidu");
>   Bucketizer b = new 
> Bucketizer().setInputCol("c_age").setSplits(new double[] { 
> Double.NEGATIVE_INFINITY, 0, 10,
>   20, 30, 40, 50, 60, 70, 80, 90, 100, 
> Double.POSITIVE_INFINITY }).setOutputCol("age");
>   DataFrame cube = new Cuber(b.transform(d))
>   .addFields("day", "AREA_CODE", "CUST_TYPE", 
> "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
>   .min("age").sum("zwsc").count().buildcube();
>   
> cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
> {code}
> Summary Metrics for 12 Completed Tasks
> MetricMin 25th percentile Median  75th percentile Max
> Duration  2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
> GC Time   1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
> Shuffle Read Size / Records   728.4 KB / 21886736.6 KB / 22258
> 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783
> Shuffle Write Size / Records  74.3 MB / 1926282   75.8 MB / 1965860   
> 76.2 MB / 1976004   76.4 MB / 1981516   77.9 MB / 2021142



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields

2016-07-04 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15361870#comment-15361870
 ] 

lichenglin commented on SPARK-16361:


I have set master url in java application.

here is a copy from spark master's ui. 

Completed Applications

Application ID  NameCores   Memory per Node Submitted Time  UserState   
Duration
app-20160704175221-0090 no-name 12  40.0 GB 2016/07/04 17:52:21 root
KILLED  2.1 h
app-20160704174141-0089 no-name 12  20.0 GB 2016/07/04 17:41:41 root
KILLED  10 min

I add another two fields into the cube.

the jobs both crash down.

> It takes a long time for gc when building cube with  many fields
> 
>
> Key: SPARK-16361
> URL: https://issues.apache.org/jira/browse/SPARK-16361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: lichenglin
>
> I'm using spark to build cube on a dataframe with 1m data.
> I found that when I add too many fields (about 8 or above) 
> the worker takes a lot of time for GC.
> I try to increase the memory of each worker but it not work well.
> but I don't know why,sorry.
> here is my simple code and monitoring 
> Cuber is a util class for building cube.
> {code:title=Bar.java|borderStyle=solid}
>   sqlContext.udf().register("jidu", (Integer f) -> {
>   return (f - 1) / 3 + 1;
>   } , DataTypes.IntegerType);
>   DataFrame d = 
> sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as 
> double) as c_age",
>   "month(day) as month", "year(day) as year", 
> "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
>   "jidu(month(day)) as jidu");
>   Bucketizer b = new 
> Bucketizer().setInputCol("c_age").setSplits(new double[] { 
> Double.NEGATIVE_INFINITY, 0, 10,
>   20, 30, 40, 50, 60, 70, 80, 90, 100, 
> Double.POSITIVE_INFINITY }).setOutputCol("age");
>   DataFrame cube = new Cuber(b.transform(d))
>   .addFields("day", "AREA_CODE", "CUST_TYPE", 
> "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
>   .min("age").sum("zwsc").count().buildcube();
>   
> cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
> {code}
> Summary Metrics for 12 Completed Tasks
> MetricMin 25th percentile Median  75th percentile Max
> Duration  2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
> GC Time   1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
> Shuffle Read Size / Records   728.4 KB / 21886736.6 KB / 22258
> 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783
> Shuffle Write Size / Records  74.3 MB / 1926282   75.8 MB / 1965860   
> 76.2 MB / 1976004   76.4 MB / 1981516   77.9 MB / 2021142



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields

2016-07-04 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15361117#comment-15361117
 ] 

lichenglin commented on SPARK-16361:


GenerateUnsafeProjection: Code generated in 4.012162 ms

The worker has a lot of log like this 

Is it important??

> It takes a long time for gc when building cube with  many fields
> 
>
> Key: SPARK-16361
> URL: https://issues.apache.org/jira/browse/SPARK-16361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: lichenglin
>
> I'm using spark to build cube on a dataframe with 1m data.
> I found that when I add too many fields (about 8 or above) 
> the worker takes a lot of time for GC.
> I try to increase the memory of each worker but it not work well.
> but I don't know why,sorry.
> here is my simple code and monitoring 
> Cuber is a util class for building cube.
> {code:title=Bar.java|borderStyle=solid}
>   sqlContext.udf().register("jidu", (Integer f) -> {
>   return (f - 1) / 3 + 1;
>   } , DataTypes.IntegerType);
>   DataFrame d = 
> sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as 
> double) as c_age",
>   "month(day) as month", "year(day) as year", 
> "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
>   "jidu(month(day)) as jidu");
>   Bucketizer b = new 
> Bucketizer().setInputCol("c_age").setSplits(new double[] { 
> Double.NEGATIVE_INFINITY, 0, 10,
>   20, 30, 40, 50, 60, 70, 80, 90, 100, 
> Double.POSITIVE_INFINITY }).setOutputCol("age");
>   DataFrame cube = new Cuber(b.transform(d))
>   .addFields("day", "AREA_CODE", "CUST_TYPE", 
> "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
>   .min("age").sum("zwsc").count().buildcube();
>   
> cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
> {code}
> Summary Metrics for 12 Completed Tasks
> MetricMin 25th percentile Median  75th percentile Max
> Duration  2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
> GC Time   1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
> Shuffle Read Size / Records   728.4 KB / 21886736.6 KB / 22258
> 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783
> Shuffle Write Size / Records  74.3 MB / 1926282   75.8 MB / 1965860   
> 76.2 MB / 1976004   76.4 MB / 1981516   77.9 MB / 2021142



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields

2016-07-04 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15361112#comment-15361112
 ] 

lichenglin commented on SPARK-16361:


Here is my whole setting 
{code}
spark.local.dir  /home/sparktmp
#spark.executor.cores4
spark.sql.parquet.cacheMetadata  false
spark.port.maxRetries5000
spark.kryoserializer.buffer.max  1024M
spark.kryoserializer.buffer  5M
spark.master spark://agent170:7077
spark.eventLog.enabled   true
spark.eventLog.dir   hdfs://agent170:9000/sparklog
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.memory4g
spark.driver.memory  2g
#spark.executor.extraJavaOptions=-XX:ReservedCodeCacheSize=256m -verbose:gc 
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps
spark.executor.extraClassPath=/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/mysql-connector-java-5.1.34.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/oracle-driver.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/phoenix-4.6.0-HBase-1.1-client-whithoutlib-thriserver-fastxml.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/phoenix-spark-4.6.0-HBase-1.1.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/spark-csv_2.10-1.3.0.jar
spark.driver.extraClassPath=/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/mysql-connector-java-5.1.34.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/oracle-driver.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/phoenix-4.6.0-HBase-1.1-client-whithoutlib-thriserver-fastxml.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/phoenix-spark-4.6.0-HBase-1.1.jar:/home/hadoop/spark-1.6.1-bin-hadoop2.6/extlib/spark-csv_2.10-1.3.0.jar
{code}
{code}
export SPARK_WORKER_MEMORY=50g
export SPARK_MASTER_OPTS=-Xmx4096m
export HADOOP_CONF_DIR=/home/hadoop/hadoop-2.6.0
export HADOOP_HOME=/home/hadoop/hadoop-2.6.0
export 
SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=hdfs://agent170:9000/sparklog
{code}
here is my command 
/home/hadoop/spark-1.6.1-bin-hadoop2.6/bin/spark-submit --executor-memory 40g  
--executor-cores 12 --class com.bjdv.spark.job.cube.CubeDemo  
/home/hadoop/lib/licl/sparkjob.jar 2016-07-01



> It takes a long time for gc when building cube with  many fields
> 
>
> Key: SPARK-16361
> URL: https://issues.apache.org/jira/browse/SPARK-16361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: lichenglin
>
> I'm using spark to build cube on a dataframe with 1m data.
> I found that when I add too many fields (about 8 or above) 
> the worker takes a lot of time for GC.
> I try to increase the memory of each worker but it not work well.
> but I don't know why,sorry.
> here is my simple code and monitoring 
> Cuber is a util class for building cube.
> {code:title=Bar.java|borderStyle=solid}
>   sqlContext.udf().register("jidu", (Integer f) -> {
>   return (f - 1) / 3 + 1;
>   } , DataTypes.IntegerType);
>   DataFrame d = 
> sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as 
> double) as c_age",
>   "month(day) as month", "year(day) as year", 
> "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
>   "jidu(month(day)) as jidu");
>   Bucketizer b = new 
> Bucketizer().setInputCol("c_age").setSplits(new double[] { 
> Double.NEGATIVE_INFINITY, 0, 10,
>   20, 30, 40, 50, 60, 70, 80, 90, 100, 
> Double.POSITIVE_INFINITY }).setOutputCol("age");
>   DataFrame cube = new Cuber(b.transform(d))
>   .addFields("day", "AREA_CODE", "CUST_TYPE", 
> "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
>   .min("age").sum("zwsc").count().buildcube();
>   
> cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
> {code}
> Summary Metrics for 12 Completed Tasks
> MetricMin 25th percentile Median  75th percentile Max
> Duration  2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
> GC Time   1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
> Shuffle Read Size / Records   728.4 KB / 21886736.6 KB / 22258
> 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783
> Shuffle Write Size / Records  74.3 MB / 1926282   75.8 MB / 1965860   
> 76.2 MB / 1976004   76.4 MB / 1981516   77.9 MB / 2021142



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields

2016-07-04 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15361073#comment-15361073
 ] 

lichenglin commented on SPARK-16361:


The data'size is 1 million.

I'm sure that 40 GB memory is faster than the job with 20 GB.

But building a cube with 1 million data need more than 40 GB memory to reduce 
the GC TIME.

It's really not cool.

> It takes a long time for gc when building cube with  many fields
> 
>
> Key: SPARK-16361
> URL: https://issues.apache.org/jira/browse/SPARK-16361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: lichenglin
>
> I'm using spark to build cube on a dataframe with 1m data.
> I found that when I add too many fields (about 8 or above) 
> the worker takes a lot of time for GC.
> I try to increase the memory of each worker but it not work well.
> but I don't know why,sorry.
> here is my simple code and monitoring 
> Cuber is a util class for building cube.
> {code:title=Bar.java|borderStyle=solid}
>   sqlContext.udf().register("jidu", (Integer f) -> {
>   return (f - 1) / 3 + 1;
>   } , DataTypes.IntegerType);
>   DataFrame d = 
> sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as 
> double) as c_age",
>   "month(day) as month", "year(day) as year", 
> "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
>   "jidu(month(day)) as jidu");
>   Bucketizer b = new 
> Bucketizer().setInputCol("c_age").setSplits(new double[] { 
> Double.NEGATIVE_INFINITY, 0, 10,
>   20, 30, 40, 50, 60, 70, 80, 90, 100, 
> Double.POSITIVE_INFINITY }).setOutputCol("age");
>   DataFrame cube = new Cuber(b.transform(d))
>   .addFields("day", "AREA_CODE", "CUST_TYPE", 
> "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
>   .min("age").sum("zwsc").count().buildcube();
>   
> cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
> {code}
> Summary Metrics for 12 Completed Tasks
> MetricMin 25th percentile Median  75th percentile Max
> Duration  2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
> GC Time   1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
> Shuffle Read Size / Records   728.4 KB / 21886736.6 KB / 22258
> 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783
> Shuffle Write Size / Records  74.3 MB / 1926282   75.8 MB / 1965860   
> 76.2 MB / 1976004   76.4 MB / 1981516   77.9 MB / 2021142



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16361) It takes a long time for gc when building cube with many fields

2016-07-04 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15361039#comment-15361039
 ] 

lichenglin edited comment on SPARK-16361 at 7/4/16 9:03 AM:


"A long time" means the gctime/Duration of each task.

you can find it in the monitoring server in some stage.

Every fields I add, this percent increase too ,until stuck the whole job.

My data's size is about 1 million, 1 node with 16 cores and 64 GB memory.

I have increase the memory of executor from  20 GB to 40 GB but not work well.




was (Author: licl):
"A long time" means the gctime/Duration of each task.

you can find it in the monitoring server.

Every fields I add, this percent increase too ,until stuck the whole job.

My data's size is about 1 million, 1 node with 16 cores and 64 GB memory.

I have increase the memory of executor from  20 GB to 40 GB but not work well.



> It takes a long time for gc when building cube with  many fields
> 
>
> Key: SPARK-16361
> URL: https://issues.apache.org/jira/browse/SPARK-16361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: lichenglin
>
> I'm using spark to build cube on a dataframe with 1m data.
> I found that when I add too many fields (about 8 or above) 
> the worker takes a lot of time for GC.
> I try to increase the memory of each worker but it not work well.
> but I don't know why,sorry.
> here is my simple code and monitoring 
> Cuber is a util class for building cube.
> {code:title=Bar.java|borderStyle=solid}
>   sqlContext.udf().register("jidu", (Integer f) -> {
>   return (f - 1) / 3 + 1;
>   } , DataTypes.IntegerType);
>   DataFrame d = 
> sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as 
> double) as c_age",
>   "month(day) as month", "year(day) as year", 
> "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
>   "jidu(month(day)) as jidu");
>   Bucketizer b = new 
> Bucketizer().setInputCol("c_age").setSplits(new double[] { 
> Double.NEGATIVE_INFINITY, 0, 10,
>   20, 30, 40, 50, 60, 70, 80, 90, 100, 
> Double.POSITIVE_INFINITY }).setOutputCol("age");
>   DataFrame cube = new Cuber(b.transform(d))
>   .addFields("day", "AREA_CODE", "CUST_TYPE", 
> "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
>   .min("age").sum("zwsc").count().buildcube();
>   
> cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
> {code}
> Summary Metrics for 12 Completed Tasks
> MetricMin 25th percentile Median  75th percentile Max
> Duration  2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
> GC Time   1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
> Shuffle Read Size / Records   728.4 KB / 21886736.6 KB / 22258
> 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783
> Shuffle Write Size / Records  74.3 MB / 1926282   75.8 MB / 1965860   
> 76.2 MB / 1976004   76.4 MB / 1981516   77.9 MB / 2021142



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16361) It takes a long time for gc when building cube with many fields

2016-07-04 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15361039#comment-15361039
 ] 

lichenglin commented on SPARK-16361:


"A long time" means the gctime/Duration of each task.

you can find it in the monitoring server.

Every fields I add, this percent increase too ,until stuck the whole job.

My data's size is about 1 million, 1 node with 16 cores and 64 GB memory.

I have increase the memory of executor from  20 GB to 40 GB but not work well.



> It takes a long time for gc when building cube with  many fields
> 
>
> Key: SPARK-16361
> URL: https://issues.apache.org/jira/browse/SPARK-16361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: lichenglin
>
> I'm using spark to build cube on a dataframe with 1m data.
> I found that when I add too many fields (about 8 or above) 
> the worker takes a lot of time for GC.
> I try to increase the memory of each worker but it not work well.
> but I don't know why,sorry.
> here is my simple code and monitoring 
> Cuber is a util class for building cube.
> {code:title=Bar.java|borderStyle=solid}
>   sqlContext.udf().register("jidu", (Integer f) -> {
>   return (f - 1) / 3 + 1;
>   } , DataTypes.IntegerType);
>   DataFrame d = 
> sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as 
> double) as c_age",
>   "month(day) as month", "year(day) as year", 
> "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
>   "jidu(month(day)) as jidu");
>   Bucketizer b = new 
> Bucketizer().setInputCol("c_age").setSplits(new double[] { 
> Double.NEGATIVE_INFINITY, 0, 10,
>   20, 30, 40, 50, 60, 70, 80, 90, 100, 
> Double.POSITIVE_INFINITY }).setOutputCol("age");
>   DataFrame cube = new Cuber(b.transform(d))
>   .addFields("day", "AREA_CODE", "CUST_TYPE", 
> "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
>   .min("age").sum("zwsc").count().buildcube();
>   
> cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
> {code}
> Summary Metrics for 12 Completed Tasks
> MetricMin 25th percentile Median  75th percentile Max
> Duration  2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
> GC Time   1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
> Shuffle Read Size / Records   728.4 KB / 21886736.6 KB / 22258
> 738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783
> Shuffle Write Size / Records  74.3 MB / 1926282   75.8 MB / 1965860   
> 76.2 MB / 1976004   76.4 MB / 1981516   77.9 MB / 2021142



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16361) It takes a long time for gc when building cube with many fields

2016-07-04 Thread lichenglin (JIRA)
lichenglin created SPARK-16361:
--

 Summary: It takes a long time for gc when building cube with  many 
fields
 Key: SPARK-16361
 URL: https://issues.apache.org/jira/browse/SPARK-16361
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.2
Reporter: lichenglin


I'm using spark to build cube on a dataframe with 1m data.
I found that when I add too many fields (about 8 or above) 
the worker takes a lot of time for GC.
I try to increase the memory of each worker but it not work well.
but I don't know why,sorry.
here is my simple code and monitoring 
Cuber is a util class for building cube.

{code:title=Bar.java|borderStyle=solid}
sqlContext.udf().register("jidu", (Integer f) -> {
return (f - 1) / 3 + 1;

} , DataTypes.IntegerType);
DataFrame d = 
sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as double) 
as c_age",
"month(day) as month", "year(day) as year", 
"cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc",
"jidu(month(day)) as jidu");
Bucketizer b = new 
Bucketizer().setInputCol("c_age").setSplits(new double[] { 
Double.NEGATIVE_INFINITY, 0, 10,
20, 30, 40, 50, 60, 70, 80, 90, 100, 
Double.POSITIVE_INFINITY }).setOutputCol("age");
DataFrame cube = new Cuber(b.transform(d))
.addFields("day", "AREA_CODE", "CUST_TYPE", 
"age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age")
.min("age").sum("zwsc").count().buildcube();

cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo");
{code}
Summary Metrics for 12 Completed Tasks

Metric  Min 25th percentile Median  75th percentile Max
Duration2.6 min 2.7 min 2.7 min 2.7 min 2.7 min
GC Time 1.6 min 1.6 min 1.6 min 1.6 min 1.6 min
Shuffle Read Size / Records 728.4 KB / 21886736.6 KB / 22258
738.7 KB / 22387746.6 KB / 22542748.6 KB / 22783
Shuffle Write Size / Records74.3 MB / 1926282   75.8 MB / 1965860   
76.2 MB / 1976004   76.4 MB / 1981516   77.9 MB / 2021142




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15900) please add a map param on MQTTUtils.createStream for setting MqttConnectOptions

2016-06-11 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-15900:
---
Summary: please add a map param on MQTTUtils.createStream for setting 
MqttConnectOptions   (was: please add a map param on MQTTUtils.createStreamfor 
setting MqttConnectOptions )

> please add a map param on MQTTUtils.createStream for setting 
> MqttConnectOptions 
> 
>
> Key: SPARK-15900
> URL: https://issues.apache.org/jira/browse/SPARK-15900
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: lichenglin
>
> I notice that MQTTReceiver will create a connection with the method
> (org.eclipse.paho.client.mqttv3.MqttClient) client.connect()
> It just means client.connect(new MqttConnectOptions());
> this causes that we have to use the default MqttConnectOptions and can't set 
> other param like usename and password.
> please add a new method at MQTTUtils.createStream like 
> createStream(jssc.ssc, brokerUrl, topic, map,storageLevel)
> in order to make a none-default MqttConnectOptions.
> thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15900) please add a map param on MQTTUtils.createStreamfor setting MqttConnectOptions

2016-06-11 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-15900:
---
Summary: please add a map param on MQTTUtils.createStreamfor setting 
MqttConnectOptions   (was: please add a map param on MQTTUtils.create for 
setting MqttConnectOptions )

> please add a map param on MQTTUtils.createStreamfor setting 
> MqttConnectOptions 
> ---
>
> Key: SPARK-15900
> URL: https://issues.apache.org/jira/browse/SPARK-15900
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: lichenglin
>
> I notice that MQTTReceiver will create a connection with the method
> (org.eclipse.paho.client.mqttv3.MqttClient) client.connect()
> It just means client.connect(new MqttConnectOptions());
> this causes that we have to use the default MqttConnectOptions and can't set 
> other param like usename and password.
> please add a new method at MQTTUtils.createStream like 
> createStream(jssc.ssc, brokerUrl, topic, map,storageLevel)
> in order to make a none-default MqttConnectOptions.
> thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15900) please add a map param on MQTTUtils.create for setting MqttConnectOptions

2016-06-11 Thread lichenglin (JIRA)
lichenglin created SPARK-15900:
--

 Summary: please add a map param on MQTTUtils.create for setting 
MqttConnectOptions 
 Key: SPARK-15900
 URL: https://issues.apache.org/jira/browse/SPARK-15900
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 1.6.1
Reporter: lichenglin


I notice that MQTTReceiver will create a connection with the method
(org.eclipse.paho.client.mqttv3.MqttClient) client.connect()
It just means client.connect(new MqttConnectOptions());
this causes that we have to use the default MqttConnectOptions and can't set 
other param like usename and password.

please add a new method at MQTTUtils.createStream like 
createStream(jssc.ssc, brokerUrl, topic, map,storageLevel)
in order to make a none-default MqttConnectOptions.
thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15497) DecisionTreeClassificationModel can't be saved within in Pipeline caused by not implement Writable

2016-05-23 Thread lichenglin (JIRA)
lichenglin created SPARK-15497:
--

 Summary: DecisionTreeClassificationModel can't be saved within in  
Pipeline caused by not implement Writable 
 Key: SPARK-15497
 URL: https://issues.apache.org/jira/browse/SPARK-15497
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.6.1
Reporter: lichenglin


Here is my code
{code}
SQLContext sqlContext = getSQLContext();
DataFrame data = 
sqlContext.read().format("libsvm").load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt");
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
StringIndexerModel labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data);
// Automatically identify categorical features, and index them.
VectorIndexerModel featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4) // features with > 4 distinct values are 
treated as continuous
  .fit(data);

// Split the data into training and test sets (30% held out for 
testing)
DataFrame[] splits = data.randomSplit(new double[]{0.7, 0.3});
DataFrame trainingData = splits[0];
DataFrame testData = splits[1];

// Train a DecisionTree model.
DecisionTreeClassifier dt = new DecisionTreeClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures");

// Convert indexed labels back to original labels.
IndexToString labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels());

// Chain indexers and tree in a Pipeline
Pipeline pipeline = new Pipeline()
  .setStages(new PipelineStage[]{labelIndexer, featureIndexer, 
dt, labelConverter});

// Train model.  This also runs the indexers.
PipelineModel model = pipeline.fit(trainingData);
model.save("file:///e:/tmpmodel");
{code}

and here is the exception
{code}
Exception in thread "main" java.lang.UnsupportedOperationException: Pipeline 
write will fail on this Pipeline because it contains a stage which does not 
implement Writable. Non-Writable stage: dtc_7bdeae1c4fb8 of type class 
org.apache.spark.ml.classification.DecisionTreeClassificationModel
at 
org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$validateStages$1.apply(Pipeline.scala:218)
at 
org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$validateStages$1.apply(Pipeline.scala:215)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
org.apache.spark.ml.Pipeline$SharedReadWrite$.validateStages(Pipeline.scala:215)
at 
org.apache.spark.ml.PipelineModel$PipelineModelWriter.(Pipeline.scala:325)
at org.apache.spark.ml.PipelineModel.write(Pipeline.scala:309)
at org.apache.spark.ml.util.MLWritable$class.save(ReadWrite.scala:131)
at org.apache.spark.ml.PipelineModel.save(Pipeline.scala:280)
at com.bjdv.spark.job.Testjob.main(Testjob.java:142)
{code}

sample_libsvm_data.txt is included in the 1.6.1 release tar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2016-05-23 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297540#comment-15297540
 ] 

lichenglin commented on SPARK-15044:


This exception is caused by “the HiveContext cache the metadata of the table” ;
When you delete the hdfs,
you must refresh table by using  sqlContext.refreshTable(tableName);

I think it's a bug too.

especially when using hiveserver to query the table that has been delete by 
java application or other way,this exception occur too. 




> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15478) LogisticRegressionModel coefficients() returns an empty vector

2016-05-23 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin closed SPARK-15478.
--
Resolution: Not A Problem

> LogisticRegressionModel coefficients() returns an empty vector
> --
>
> Key: SPARK-15478
> URL: https://issues.apache.org/jira/browse/SPARK-15478
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: lichenglin
>
> I'm not sure this is a bug.
> I'm runing the sample code like this
> {code}
>   public static void main(String[] args) {
>   SQLContext sqlContext = getSQLContext();
>   DataFrame training = sqlContext.read().format("libsvm")
>   
> .load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt");
>   LogisticRegression lr = new 
> LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8);
>   // Fit the model
>   LogisticRegressionModel lrModel = lr.fit(training);
>   // Print the coefficients and intercept for logistic regression
>   System.out.println("Coefficients: " + lrModel.coefficients() + 
> " Intercept: " + lrModel.intercept());
>   }
> {code}
> but the Coefficients always return a empty vector.
> {code}
> Coefficients: (692,[],[]) Intercept: 2.751535313041949
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15478) LogisticRegressionModel coefficients() returns an empty vector

2016-05-23 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296332#comment-15296332
 ] 

lichenglin commented on SPARK-15478:


Sorry,I made a mistake with wrong example data

> LogisticRegressionModel coefficients() returns an empty vector
> --
>
> Key: SPARK-15478
> URL: https://issues.apache.org/jira/browse/SPARK-15478
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: lichenglin
>
> I'm not sure this is a bug.
> I'm runing the sample code like this
> {code}
>   public static void main(String[] args) {
>   SQLContext sqlContext = getSQLContext();
>   DataFrame training = sqlContext.read().format("libsvm")
>   
> .load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt");
>   LogisticRegression lr = new 
> LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8);
>   // Fit the model
>   LogisticRegressionModel lrModel = lr.fit(training);
>   // Print the coefficients and intercept for logistic regression
>   System.out.println("Coefficients: " + lrModel.coefficients() + 
> " Intercept: " + lrModel.intercept());
>   }
> {code}
> but the Coefficients always return a empty vector.
> {code}
> Coefficients: (692,[],[]) Intercept: 2.751535313041949
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15478) LogisticRegressionModel 's Coefficients always return a empty vector

2016-05-23 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-15478:
---
Description: 
I'm not sure this is a bug.
I'm runing the sample code like this
{code}
public static void main(String[] args) {
SQLContext sqlContext = getSQLContext();
DataFrame training = sqlContext.read().format("libsvm")

.load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt");
LogisticRegression lr = new 
LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8);
// Fit the model
LogisticRegressionModel lrModel = lr.fit(training);

// Print the coefficients and intercept for logistic regression
System.out.println("Coefficients: " + lrModel.coefficients() + 
" Intercept: " + lrModel.intercept());

}
{code}

but the Coefficients always return a empty vector.
{code}
Coefficients: (692,[],[]) Intercept: 2.751535313041949
{code}

  was:
I don't know if this is a bug.
I'm runing the sample code like this
{code}
public static void main(String[] args) {
SQLContext sqlContext = getSQLContext();
DataFrame training = sqlContext.read().format("libsvm")

.load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt");
LogisticRegression lr = new 
LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8);
// Fit the model
LogisticRegressionModel lrModel = lr.fit(training);

// Print the coefficients and intercept for logistic regression
System.out.println("Coefficients: " + lrModel.coefficients() + 
" Intercept: " + lrModel.intercept());

}
{code}

but the Coefficients always return a empty vector.
{code}
Coefficients: (692,[],[]) Intercept: 2.751535313041949
{code}


> LogisticRegressionModel 's  Coefficients always  return a empty vector
> --
>
> Key: SPARK-15478
> URL: https://issues.apache.org/jira/browse/SPARK-15478
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: lichenglin
>
> I'm not sure this is a bug.
> I'm runing the sample code like this
> {code}
>   public static void main(String[] args) {
>   SQLContext sqlContext = getSQLContext();
>   DataFrame training = sqlContext.read().format("libsvm")
>   
> .load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt");
>   LogisticRegression lr = new 
> LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8);
>   // Fit the model
>   LogisticRegressionModel lrModel = lr.fit(training);
>   // Print the coefficients and intercept for logistic regression
>   System.out.println("Coefficients: " + lrModel.coefficients() + 
> " Intercept: " + lrModel.intercept());
>   }
> {code}
> but the Coefficients always return a empty vector.
> {code}
> Coefficients: (692,[],[]) Intercept: 2.751535313041949
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15478) LogisticRegressionModel 's Coefficients always return a empty vector

2016-05-23 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-15478:
---
Summary: LogisticRegressionModel 's  Coefficients always  return a empty 
vector  (was: LogisticRegressionModel 's  Coefficients always  be a empty 
vector)

> LogisticRegressionModel 's  Coefficients always  return a empty vector
> --
>
> Key: SPARK-15478
> URL: https://issues.apache.org/jira/browse/SPARK-15478
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: lichenglin
>
> I don't know if this is a bug.
> I'm runing the sample code like this
> {code}
>   public static void main(String[] args) {
>   SQLContext sqlContext = getSQLContext();
>   DataFrame training = sqlContext.read().format("libsvm")
>   
> .load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt");
>   LogisticRegression lr = new 
> LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8);
>   // Fit the model
>   LogisticRegressionModel lrModel = lr.fit(training);
>   // Print the coefficients and intercept for logistic regression
>   System.out.println("Coefficients: " + lrModel.coefficients() + 
> " Intercept: " + lrModel.intercept());
>   }
> {code}
> but the Coefficients always return a empty vector.
> {code}
> Coefficients: (692,[],[]) Intercept: 2.751535313041949
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15478) LogisticRegressionModel 's Coefficients always be a empty vector

2016-05-23 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-15478:
---
Description: 
I don't know if this is a bug.
I'm runing the sample code like this
{code}
public static void main(String[] args) {
SQLContext sqlContext = getSQLContext();
DataFrame training = sqlContext.read().format("libsvm")

.load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt");
LogisticRegression lr = new 
LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8);
// Fit the model
LogisticRegressionModel lrModel = lr.fit(training);

// Print the coefficients and intercept for logistic regression
System.out.println("Coefficients: " + lrModel.coefficients() + 
" Intercept: " + lrModel.intercept());

}
{code}

but the Coefficients always return a empty vector.
{code}
Coefficients: (692,[],[]) Intercept: 2.751535313041949
{code}

  was:
I don't know if this is a bug.
I'm runing the sample code like this

public static void main(String[] args) {
SQLContext sqlContext = getSQLContext();
DataFrame training = sqlContext.read().format("libsvm")

.load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt");
LogisticRegression lr = new 
LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8);
// Fit the model
LogisticRegressionModel lrModel = lr.fit(training);

// Print the coefficients and intercept for logistic regression
System.out.println("Coefficients: " + lrModel.coefficients() + 
" Intercept: " + lrModel.intercept());

}


but the Coefficients always return a empty vector.

Coefficients: (692,[],[]) Intercept: 2.751535313041949



> LogisticRegressionModel 's  Coefficients always  be a empty vector
> --
>
> Key: SPARK-15478
> URL: https://issues.apache.org/jira/browse/SPARK-15478
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: lichenglin
>
> I don't know if this is a bug.
> I'm runing the sample code like this
> {code}
>   public static void main(String[] args) {
>   SQLContext sqlContext = getSQLContext();
>   DataFrame training = sqlContext.read().format("libsvm")
>   
> .load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt");
>   LogisticRegression lr = new 
> LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8);
>   // Fit the model
>   LogisticRegressionModel lrModel = lr.fit(training);
>   // Print the coefficients and intercept for logistic regression
>   System.out.println("Coefficients: " + lrModel.coefficients() + 
> " Intercept: " + lrModel.intercept());
>   }
> {code}
> but the Coefficients always return a empty vector.
> {code}
> Coefficients: (692,[],[]) Intercept: 2.751535313041949
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15478) LogisticRegressionModel 's Coefficients always be a empty vector

2016-05-23 Thread lichenglin (JIRA)
lichenglin created SPARK-15478:
--

 Summary: LogisticRegressionModel 's  Coefficients always  be a 
empty vector
 Key: SPARK-15478
 URL: https://issues.apache.org/jira/browse/SPARK-15478
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.6.1
Reporter: lichenglin


I don't know if this is a bug.
I'm runing the sample code like this

public static void main(String[] args) {
SQLContext sqlContext = getSQLContext();
DataFrame training = sqlContext.read().format("libsvm")

.load("file:///E:/workspace-mars/bigdata/sparkjob/data/mllib/sample_libsvm_data.txt");
LogisticRegression lr = new 
LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8);
// Fit the model
LogisticRegressionModel lrModel = lr.fit(training);

// Print the coefficients and intercept for logistic regression
System.out.println("Coefficients: " + lrModel.coefficients() + 
" Intercept: " + lrModel.intercept());

}


but the Coefficients always return a empty vector.

Coefficients: (692,[],[]) Intercept: 2.751535313041949




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14886) RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException

2016-04-25 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-14886:
---
Description: 
{code} 
@Since("1.2.0")
  def ndcgAt(k: Int): Double = {
require(k > 0, "ranking position k should be positive")
predictionAndLabels.map { case (pred, lab) =>
  val labSet = lab.toSet

  if (labSet.nonEmpty) {
val labSetSize = labSet.size
val n = math.min(math.max(pred.length, labSetSize), k)
var maxDcg = 0.0
var dcg = 0.0
var i = 0
while (i < n) {
  val gain = 1.0 / math.log(i + 2)
  if (labSet.contains(pred(i))) {
dcg += gain
  }
  if (i < labSetSize) {
maxDcg += gain
  }
  i += 1
}
dcg / maxDcg
  } else {
logWarning("Empty ground truth set, check input data")
0.0
  }
}.mean()
  }
{code}

"if (labSet.contains(pred(i)))" will throw ArrayIndexOutOfBoundsException when 
pred's size less then k.
That meas the true relevant documents has less size then the param k.
just try this with sample_movielens_data.txt
for example set pred.size to 5,labSetSize to 10,k to 20,then the n is 10. 
pred[10] not exists;
precisionAt is ok just because it has 
val n = math.min(pred.length, k)




  was:
{code} 
@Since("1.2.0")
  def ndcgAt(k: Int): Double = {
require(k > 0, "ranking position k should be positive")
predictionAndLabels.map { case (pred, lab) =>
  val labSet = lab.toSet

  if (labSet.nonEmpty) {
val labSetSize = labSet.size
val n = math.min(math.max(pred.length, labSetSize), k)
var maxDcg = 0.0
var dcg = 0.0
var i = 0
while (i < n) {
  val gain = 1.0 / math.log(i + 2)
  if (labSet.contains(pred(i))) {
dcg += gain
  }
  if (i < labSetSize) {
maxDcg += gain
  }
  i += 1
}
dcg / maxDcg
  } else {
logWarning("Empty ground truth set, check input data")
0.0
  }
}.mean()
  }
{code}

"if (labSet.contains(pred(i)))" will throw ArrayIndexOutOfBoundsException when 
pred's size less then k.
That meas the true relevant documents has less size then the param k.
just try this with sample_movielens_data.txt
precisionAt is ok just because it has 
val n = math.min(pred.length, k)





> RankingMetrics.ndcgAt  throw  java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: SPARK-14886
> URL: https://issues.apache.org/jira/browse/SPARK-14886
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: lichenglin
>Priority: Minor
>
> {code} 
> @Since("1.2.0")
>   def ndcgAt(k: Int): Double = {
> require(k > 0, "ranking position k should be positive")
> predictionAndLabels.map { case (pred, lab) =>
>   val labSet = lab.toSet
>   if (labSet.nonEmpty) {
> val labSetSize = labSet.size
> val n = math.min(math.max(pred.length, labSetSize), k)
> var maxDcg = 0.0
> var dcg = 0.0
> var i = 0
> while (i < n) {
>   val gain = 1.0 / math.log(i + 2)
>   if (labSet.contains(pred(i))) {
> dcg += gain
>   }
>   if (i < labSetSize) {
> maxDcg += gain
>   }
>   i += 1
> }
> dcg / maxDcg
>   } else {
> logWarning("Empty ground truth set, check input data")
> 0.0
>   }
> }.mean()
>   }
> {code}
> "if (labSet.contains(pred(i)))" will throw ArrayIndexOutOfBoundsException 
> when pred's size less then k.
> That meas the true relevant documents has less size then the param k.
> just try this with sample_movielens_data.txt
> for example set pred.size to 5,labSetSize to 10,k to 20,then the n is 10. 
> pred[10] not exists;
> precisionAt is ok just because it has 
> val n = math.min(pred.length, k)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14886) RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException

2016-04-24 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-14886:
---
Description: 
 @Since("1.2.0")
  def ndcgAt(k: Int): Double = {
require(k > 0, "ranking position k should be positive")
predictionAndLabels.map { case (pred, lab) =>
  val labSet = lab.toSet

  if (labSet.nonEmpty) {
val labSetSize = labSet.size
val n = math.min(math.max(pred.length, labSetSize), k)
var maxDcg = 0.0
var dcg = 0.0
var i = 0
while (i < n) {
  val gain = 1.0 / math.log(i + 2)
  if (labSet.contains(pred(i))) {
dcg += gain
  }
  if (i < labSetSize) {
maxDcg += gain
  }
  i += 1
}
dcg / maxDcg
  } else {
logWarning("Empty ground truth set, check input data")
0.0
  }
}.mean()
  }

"if (labSet.contains(pred(i)))" will throw ArrayIndexOutOfBoundsException when 
pred's size less then k.
That meas the true relevant documents has less size then the param k.
just try this with sample_movielens_data.txt
precisionAt is ok just because it has 
val n = math.min(pred.length, k)




  was:
 @Since("1.2.0")
  def ndcgAt(k: Int): Double = {
require(k > 0, "ranking position k should be positive")
predictionAndLabels.map { case (pred, lab) =>
  val labSet = lab.toSet

  if (labSet.nonEmpty) {
val labSetSize = labSet.size
val n = math.min(math.max(pred.length, labSetSize), k)
var maxDcg = 0.0
var dcg = 0.0
var i = 0
while (i < n) {
  val gain = 1.0 / math.log(i + 2)
  if (labSet.contains(pred(i))) {
dcg += gain
  }
  if (i < labSetSize) {
maxDcg += gain
  }
  i += 1
}
dcg / maxDcg
  } else {
logWarning("Empty ground truth set, check input data")
0.0
  }
}.mean()
  }

   if (labSet.contains(pred(i))) will throw ArrayIndexOutOfBoundsException when 
the true relevant documents has less size the the param k.
just try this with sample_movielens_data.txt
precisionAt is ok just because it has 
val n = math.min(pred.length, k)





> RankingMetrics.ndcgAt  throw  java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: SPARK-14886
> URL: https://issues.apache.org/jira/browse/SPARK-14886
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: lichenglin
>
>  @Since("1.2.0")
>   def ndcgAt(k: Int): Double = {
> require(k > 0, "ranking position k should be positive")
> predictionAndLabels.map { case (pred, lab) =>
>   val labSet = lab.toSet
>   if (labSet.nonEmpty) {
> val labSetSize = labSet.size
> val n = math.min(math.max(pred.length, labSetSize), k)
> var maxDcg = 0.0
> var dcg = 0.0
> var i = 0
> while (i < n) {
>   val gain = 1.0 / math.log(i + 2)
>   if (labSet.contains(pred(i))) {
> dcg += gain
>   }
>   if (i < labSetSize) {
> maxDcg += gain
>   }
>   i += 1
> }
> dcg / maxDcg
>   } else {
> logWarning("Empty ground truth set, check input data")
> 0.0
>   }
> }.mean()
>   }
> "if (labSet.contains(pred(i)))" will throw ArrayIndexOutOfBoundsException 
> when pred's size less then k.
> That meas the true relevant documents has less size then the param k.
> just try this with sample_movielens_data.txt
> precisionAt is ok just because it has 
> val n = math.min(pred.length, k)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14886) RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException

2016-04-24 Thread lichenglin (JIRA)
lichenglin created SPARK-14886:
--

 Summary: RankingMetrics.ndcgAt  throw  
java.lang.ArrayIndexOutOfBoundsException
 Key: SPARK-14886
 URL: https://issues.apache.org/jira/browse/SPARK-14886
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.6.1
Reporter: lichenglin


 @Since("1.2.0")
  def ndcgAt(k: Int): Double = {
require(k > 0, "ranking position k should be positive")
predictionAndLabels.map { case (pred, lab) =>
  val labSet = lab.toSet

  if (labSet.nonEmpty) {
val labSetSize = labSet.size
val n = math.min(math.max(pred.length, labSetSize), k)
var maxDcg = 0.0
var dcg = 0.0
var i = 0
while (i < n) {
  val gain = 1.0 / math.log(i + 2)
  if (labSet.contains(pred(i))) {
dcg += gain
  }
  if (i < labSetSize) {
maxDcg += gain
  }
  i += 1
}
dcg / maxDcg
  } else {
logWarning("Empty ground truth set, check input data")
0.0
  }
}.mean()
  }

   if (labSet.contains(pred(i))) will throw ArrayIndexOutOfBoundsException when 
the true relevant documents has less size the the param k.
just try this with sample_movielens_data.txt
precisionAt is ok just because it has 
val n = math.min(pred.length, k)






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13999) Run 'group by' before building cube

2016-03-19 Thread lichenglin (JIRA)
lichenglin created SPARK-13999:
--

 Summary: Run 'group by'  before building cube
 Key: SPARK-13999
 URL: https://issues.apache.org/jira/browse/SPARK-13999
 Project: Spark
  Issue Type: Improvement
Reporter: lichenglin


When I'm trying to build a cube on a data set witch has about 1 billion count.
The cube has 7 dimensions.
It takes a whole day to finish the job with 16 cores;

Then I run the 'select count (1) from table group by A,B,C,D,E,F,G' first
and run the cube with the 'group by' result data set.
The dimensions is the same as 'group by' and do sum on 'count'.
It just need 45 minutes.

the group by will reduce the data set's count from billions to  millions.
This depends on  the number  of dimension.

We can try in the new version.

The process of averaging may be complex.Should get the sum and count during the 
group by .

  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13907) Imporvement the cube with the Fast Cubing In apache Kylin

2016-03-15 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin closed SPARK-13907.
--

> Imporvement the cube with the Fast Cubing In apache Kylin
> -
>
> Key: SPARK-13907
> URL: https://issues.apache.org/jira/browse/SPARK-13907
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: lichenglin
>
> I tried to build a cube on a 100 million data set.
> When I set 9 fields to build the cube with 10 cores.
> It nearly coast me a whole day to finish the job.
> At the same time, it generate almost 1”TB“ data in the "/tmp“ folder.
> Could we refer to the ”fast cube“ algorithm in apache Kylin
> To make the cube builder more quickly???



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13907) Imporvement the cube with the Fast Cubing In apache Kylin

2016-03-15 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-13907:
---
Description: 
I tried to build a cube on a 100 million data set.
When I set 9 fields to build the cube with 10 cores.
It nearly coast me a whole day to finish the job.
At the same time, it generate almost 1”TB“ data in the "/tmp“ folder.
Could we refer to the ”fast cube“ algorithm in apache Kylin
To make the cube builder more quickly???


  was:
I tried to build a cube on a 100 million data set.
When I set 9 fields to build the cube with 10 cores.
It nearly coast me a whole day to finish the job.
At the same time, it generate almost 1”TB“ data in the "/tmp“ folder.
Could we refer to the ”fast cube“ algorithm in apache Kylin
To make the cube builder more quickly???
For example
group(A,B,C)'s result can be create by the groupdata(A,B)



> Imporvement the cube with the Fast Cubing In apache Kylin
> -
>
> Key: SPARK-13907
> URL: https://issues.apache.org/jira/browse/SPARK-13907
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: lichenglin
>
> I tried to build a cube on a 100 million data set.
> When I set 9 fields to build the cube with 10 cores.
> It nearly coast me a whole day to finish the job.
> At the same time, it generate almost 1”TB“ data in the "/tmp“ folder.
> Could we refer to the ”fast cube“ algorithm in apache Kylin
> To make the cube builder more quickly???



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13907) Imporvement the cube with the Fast Cubing In apache Kylin

2016-03-15 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-13907:
---
Description: 
I tried to build a cube on a 100 million data set.
When I set 9 fields to build the cube with 10 cores.
It nearly coast me a whole day to finish the job.
At the same time, it generate almost 1”TB“ data in the "/tmp“ folder.
Could we refer to the ”fast cube“ algorithm in apache Kylin
To make the cube builder more quickly???
For example
group(A,B,C)'s result can be create by the groupdata(A,B)


  was:
I tried to build a cube on a 100 million data set.
When I set 9 fields to build the cube with 10 cores.
It nearly coast me a whole day to finish the job.
At the same time, it generate almost 1”TB“ data in the "/tmp“ folder.
Could we refer to the ”fast cube“ algorithm in apache Kylin
To make the cube builder more quickly???


> Imporvement the cube with the Fast Cubing In apache Kylin
> -
>
> Key: SPARK-13907
> URL: https://issues.apache.org/jira/browse/SPARK-13907
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: lichenglin
>
> I tried to build a cube on a 100 million data set.
> When I set 9 fields to build the cube with 10 cores.
> It nearly coast me a whole day to finish the job.
> At the same time, it generate almost 1”TB“ data in the "/tmp“ folder.
> Could we refer to the ”fast cube“ algorithm in apache Kylin
> To make the cube builder more quickly???
> For example
> group(A,B,C)'s result can be create by the groupdata(A,B)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13907) Imporvement the cube with the Fast Cubing In apache Kylin

2016-03-15 Thread lichenglin (JIRA)
lichenglin created SPARK-13907:
--

 Summary: Imporvement the cube with the Fast Cubing In apache Kylin
 Key: SPARK-13907
 URL: https://issues.apache.org/jira/browse/SPARK-13907
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.1
Reporter: lichenglin


I tried to build a cube on a 100 million data set.
When I set 9 fields to build the cube with 10 cores.
It nearly coast me a whole day to finish the job.
At the same time, it generate almost 1”TB“ data in the "/tmp“ folder.
Could we refer to the ”fast cube“ algorithm in apache Kylin
To make the cube builder more quickly???



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13433) The standalone server should limit the count of cores and memory for running Drivers

2016-02-22 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157801#comment-15157801
 ] 

lichenglin commented on SPARK-13433:


I know the property 'spark.dirver.cores'

What I want to limit is the total  count , not the count of single driver.

So What should I do if the drivers have used all the cores???

There is no core for app ,and then driver will never stop and free their 
resources.Is it correct?

> The standalone   server should limit the count of cores and memory for 
> running Drivers
> --
>
> Key: SPARK-13433
> URL: https://issues.apache.org/jira/browse/SPARK-13433
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: lichenglin
>
> I have a 16 cores cluster.
> A  Running driver at least use 1 core may be more.
> When I submit a lot of job to the standalone  server in cluster mode.
> all the cores may be used for running driver,
> and then there is no cores to run applications
> The server is stuck.
> So I think we should limit the resources(cores and memory) for running driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13433) The standalone server should limit the count of cores and memory for running Drivers

2016-02-22 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157802#comment-15157802
 ] 

lichenglin commented on SPARK-13433:


I know the property 'spark.dirver.cores'

What I want to limit is the total  count , not the count of single driver.

So What should I do if the drivers have used all the cores???

There is no core for app ,and then driver will never stop and free their 
resources.Is it correct?

> The standalone   server should limit the count of cores and memory for 
> running Drivers
> --
>
> Key: SPARK-13433
> URL: https://issues.apache.org/jira/browse/SPARK-13433
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: lichenglin
>
> I have a 16 cores cluster.
> A  Running driver at least use 1 core may be more.
> When I submit a lot of job to the standalone  server in cluster mode.
> all the cores may be used for running driver,
> and then there is no cores to run applications
> The server is stuck.
> So I think we should limit the resources(cores and memory) for running driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13433) The standalone server should limit the count of cores and memory for running Drivers

2016-02-22 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157121#comment-15157121
 ] 

lichenglin commented on SPARK-13433:


What I mean is 
We should set a limit on the total cores for running driver.

For example,the cluster has 16 cores,the limit may be  10.

If the 11th driver come in , it should be set submited to make sure the left 6 
is used to run app. 


> The standalone   server should limit the count of cores and memory for 
> running Drivers
> --
>
> Key: SPARK-13433
> URL: https://issues.apache.org/jira/browse/SPARK-13433
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: lichenglin
>
> I have a 16 cores cluster.
> A  Running driver at least use 1 core may be more.
> When I submit a lot of job to the standalone  server in cluster mode.
> all the cores may be used for running driver,
> and then there is no cores to run applications
> The server is stuck.
> So I think we should limit the resources(cores and memory) for running driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13433) The standalone server should limit the count of cores and memory for running Drivers

2016-02-22 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157118#comment-15157118
 ] 

lichenglin commented on SPARK-13433:


But When? When someting else frees up resources??

All the cores is used for running drivers,the app will never get any core,so 
there is no driver will free it's resource.

> The standalone   server should limit the count of cores and memory for 
> running Drivers
> --
>
> Key: SPARK-13433
> URL: https://issues.apache.org/jira/browse/SPARK-13433
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: lichenglin
>
> I have a 16 cores cluster.
> A  Running driver at least use 1 core may be more.
> When I submit a lot of job to the standalone  server in cluster mode.
> all the cores may be used for running driver,
> and then there is no cores to run applications
> The server is stuck.
> So I think we should limit the resources(cores and memory) for running driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13433) The standalone server should limit the count of cores and memory for running Drivers

2016-02-22 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157050#comment-15157050
 ] 

lichenglin commented on SPARK-13433:


It's something like deadlock.

driver use all cores> application has no cores to be finished>the driver won't 
be finish>driver keep the core>driver use all cores;

if the application won't be finished, the driver won't be finished too.

I mean that We should not let the driver use all the cores,
to make sure that the application have at least one core to do it's job.




> The standalone   server should limit the count of cores and memory for 
> running Drivers
> --
>
> Key: SPARK-13433
> URL: https://issues.apache.org/jira/browse/SPARK-13433
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: lichenglin
>
> I have a 16 cores cluster.
> A  Running driver at least use 1 core may be more.
> When I submit a lot of job to the standalone  server in cluster mode.
> all the cores may be used for running driver,
> and then there is no cores to run applications
> The server is stuck.
> So I think we should limit the resources(cores and memory) for running driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13433) The standalone server should limit the count of cores and memory for running Drivers

2016-02-22 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-13433:
---
Description: 
I have a 16 cores cluster.
A  Running driver at least use 1 core may be more.
When I submit a lot of job to the standalone  server in cluster mode.
all the cores may be used for running driver,
and then there is no cores to run applications
The server is stuck.
So I think we should limit the resources(cores and memory) for running driver.



  was:
I have a 16 cores cluster.
When I submit a lot of job to the standalone  server in cluster mode.
A  Running driver at least use 1 core may be more.
So all the cores may be used for running driver,
and then there is no cores to run applications
The server is stuck.
So I think we should limit the resources for running driver.




> The standalone   server should limit the count of cores and memory for 
> running Drivers
> --
>
> Key: SPARK-13433
> URL: https://issues.apache.org/jira/browse/SPARK-13433
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: lichenglin
>
> I have a 16 cores cluster.
> A  Running driver at least use 1 core may be more.
> When I submit a lot of job to the standalone  server in cluster mode.
> all the cores may be used for running driver,
> and then there is no cores to run applications
> The server is stuck.
> So I think we should limit the resources(cores and memory) for running driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13433) The standalone server should limit the count of cores and memory for running Drivers

2016-02-22 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-13433:
---
Description: 
I have a 16 cores cluster.
When I submit a lot of job to the standalone  server in cluster mode.
A  Running driver at least use 1 core may be more.
So all the cores may be used for running driver,
and then there is no cores to run applications
The server is stuck.
So I think we should limit the resources for running driver.



  was:
I have a 16 cores cluster.
When I submit a lot of job to the standalone  server in cluster mode.
A  Running driver at least use 1 core may be more.
So all the cores may be used for running driver,
and then there is no cores to run applications
The server is stuck.
So I think we should limit the cores for running driver.



Summary: The standalone   server should limit the count of cores and 
memory for running Drivers  (was: The standalone   should limit the count of 
Running Drivers)

> The standalone   server should limit the count of cores and memory for 
> running Drivers
> --
>
> Key: SPARK-13433
> URL: https://issues.apache.org/jira/browse/SPARK-13433
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: lichenglin
>
> I have a 16 cores cluster.
> When I submit a lot of job to the standalone  server in cluster mode.
> A  Running driver at least use 1 core may be more.
> So all the cores may be used for running driver,
> and then there is no cores to run applications
> The server is stuck.
> So I think we should limit the resources for running driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13433) The standalone should limit the count of Running Drivers

2016-02-22 Thread lichenglin (JIRA)
lichenglin created SPARK-13433:
--

 Summary: The standalone   should limit the count of Running Drivers
 Key: SPARK-13433
 URL: https://issues.apache.org/jira/browse/SPARK-13433
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 1.6.0
Reporter: lichenglin


I have a 16 cores cluster.
When I submit a lot of job to the standalone  server in cluster mode.
A  Running driver at least use 1 core may be more.
So all the cores may be used for running driver,
and then there is no cores to run applications
The server is stuck.
So I think we should limit the cores for running driver.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!

2016-01-28 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15121065#comment-15121065
 ] 

lichenglin commented on SPARK-12963:


I think the driver's ip should be stationary like 'localhost'
not depended on any property

> In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' 
> failed  after 16 retries!
> -
>
> Key: SPARK-12963
> URL: https://issues.apache.org/jira/browse/SPARK-12963
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.0
>Reporter: lichenglin
>Priority: Critical
>
> I have 3 node cluster:namenode second and data1;
> I use this shell to submit job on namenode:
> bin/spark-submit   --deploy-mode cluster --class com.bjdv.spark.job.Abc  
> --total-executor-cores 5  --master spark://namenode:6066
> hdfs://namenode:9000/sparkjars/spark.jar
> The Driver may be started on the other node such as data1.
> The problem is :
> when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode
> the driver will be started with this param such as 
> SPARK_LOCAL_IP=namenode
> but the driver will start at data1,
> the dirver will try to binding the ip 'namenode' on data1.
> so driver will throw exception like this:
>  Service 'Driver' failed  after 16 retries!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!

2016-01-28 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15121053#comment-15121053
 ] 

lichenglin commented on SPARK-12963:


I think the driver's ip should be stationary like 'localhost'
not depended on any property

> In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' 
> failed  after 16 retries!
> -
>
> Key: SPARK-12963
> URL: https://issues.apache.org/jira/browse/SPARK-12963
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.0
>Reporter: lichenglin
>Priority: Critical
>
> I have 3 node cluster:namenode second and data1;
> I use this shell to submit job on namenode:
> bin/spark-submit   --deploy-mode cluster --class com.bjdv.spark.job.Abc  
> --total-executor-cores 5  --master spark://namenode:6066
> hdfs://namenode:9000/sparkjars/spark.jar
> The Driver may be started on the other node such as data1.
> The problem is :
> when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode
> the driver will be started with this param such as 
> SPARK_LOCAL_IP=namenode
> but the driver will start at data1,
> the dirver will try to binding the ip 'namenode' on data1.
> so driver will throw exception like this:
>  Service 'Driver' failed  after 16 retries!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!

2016-01-21 Thread lichenglin (JIRA)
lichenglin created SPARK-12963:
--

 Summary: In cluster mode,spark_local_ip will cause driver 
exception:Service 'Driver' failed  after 16 retries!
 Key: SPARK-12963
 URL: https://issues.apache.org/jira/browse/SPARK-12963
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.6.0
Reporter: lichenglin
Priority: Critical


I have 3 node cluster:namenode second and data1;
I use this shell to submit job on namenode:
bin/spark-submit   --deploy-mode cluster --class com.bjdv.spark.job.Abc  
--total-executor-cores 5  --master spark://namenode:6066
hdfs://namenode:9000/sparkjars/spark.jar

The Driver may be started on the other node such as data1.
The problem is :
when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode
the driver will be started with this param such as 
SPARK_LOCAL_IP=namenode
but the driver will start at data1,
the dirver will try to binding the ip 'namenode' on data1.
so driver will throw exception like this:
 Service 'Driver' failed  after 16 retries!






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org