date:20141208

Understanding reported times on the Spark UI [+ Streaming]

2014-12-08 Thread Gerard Maas

Hi,

I'm confused about the Stage times reported on the Spark-UI (Spark 1.1.0)
for an Spark-Streaming job.  I'm hoping somebody can shine some light on it:

Let's do this with an example:

On the /stages page, stage # 232 is reported to have lasted 18 seconds:
232runJob at RDDFunctions.scala:23
http://localhost:24040/stages/stage?id=232attempt=0+details

2014/12/08 15:06:2518 s
12/12
When I click on it for details, I see: [1]

Total time across all tasks = 42s

Aggregated metrics by executor:
Executor1 19s
Executor2 24s

Summing all tasks is actually: 40,009s

What is the time reported on the overview page? (18s?)

What is relation between the reported time on the overview and the detail
page?

My Spark Streaming job is reported to be taking 3m24s, and (I think)
there's only 1 stage in my job. How does the timing per stage relate to the
Spark Streaming reported in the 'streaming' page ? (e.g. 'last batch') ?

Is there a way to relate a streaming batch to the stages executed  to
complete that batch?
The numbers as they are at the moment don't seem to add up.

Thanks,

Gerard.


[1] https://drive.google.com/file/d/0BznIWnuWhoLlMkZubzY2dTdOWDQ

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-08 Thread Michael Armbrust

This is by hive's design.  From the Hive documentation:

The column change command will only modify Hive's metadata, and will not
 modify data. Users should make sure the actual data layout of the
 table/partition conforms with the metadata definition.



On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Ok, found another possible bug in Hive.

 My current solution is to use ALTER TABLE CHANGE to rename the column
 names.

 The problem is after renaming the column names, the value of the columns
 became all NULL.

 Before renaming:
 scala sql(select `sorted::cre_ts` from pmt limit 1).collect
 res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])

 Execute renaming:
 scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
 res13: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[972] at RDD at SchemaRDD.scala:108
 == Query Plan ==
 Native command: executed by Hive

 After renaming:
 scala sql(select cre_ts from pmt limit 1).collect
 res16: Array[org.apache.spark.sql.Row] = Array([null])

 I created a JIRA for it:

   https://issues.apache.org/jira/browse/SPARK-4781


 Jianshi

 On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hmm... another issue I found doing this approach is that ANALYZE TABLE
 ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
 later broadcast join and such will fail...

 Any idea how to fix this issue?

 Jianshi

 On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Very interesting, the line doing drop table will throws an exception.
 After removing it all works.

 Jianshi

 On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Here's the solution I got after talking with Liancheng:

 1) using backquote `..` to wrap up all illegal characters

 val rdd = parquetFile(file)
 val schema = rdd.schema.fields.map(f = s`${f.name}`
 ${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n)

 val ddl_13 = s
   |CREATE EXTERNAL TABLE $name (
   |  $schema
   |)
   |STORED AS PARQUET
   |LOCATION '$file'
   .stripMargin

 sql(ddl_13)

 2) create a new Schema and do applySchema to generate a new SchemaRDD,
 had to drop and register table

 val t = table(name)
 val newSchema = StructType(t.schema.fields.map(s = s.copy(name =
 s.name.replaceAll(.*?::, 
 sql(sdrop table $name)
 applySchema(t, newSchema).registerTempTable(name)

 I'm testing it for now.

 Thanks for the help!


 Jianshi

 On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hi,

 I had to use Pig for some preprocessing and to generate Parquet files
 for Spark to consume.

 However, due to Pig's limitation, the generated schema contains Pig's
 identifier

 e.g.
 sorted::id, sorted::cre_ts, ...

 I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.

   create external table pmt (
 sorted::id bigint
   )
   stored as parquet
   location '...'

 Obviously it didn't work, I also tried removing the identifier
 sorted::, but the resulting rows contain only nulls.

 Any idea how to create a table in HiveContext from these Parquet files?

 Thanks,
 Jianshi
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/

Re: Handling stale PRs

2014-12-08 Thread Nicholas Chammas

I recently came across this blog post, which reminded me of this thread.

How to Discourage Open Source Contributions
http://danluu.com/discourage-oss/

We are currently at 320+ open PRs, many of which haven't been updated in
over a month. We have quite a few PRs that haven't been touched in 3-5
months.

*If you have the time and interest, please hop on over to the Spark PR
Dashboard https://spark-prs.appspot.com/, sort the PRs by
least-recently-updated, and update them where you can.*

I share the blog author's opinion that letting PRs go stale discourages
contributions, especially from first-time contributors, and especially more
so when the PR author is waiting on feedback from a committer or
contributor.

I've been thinking about simple ways to make it easier for all of us to
chip in on controlling stale PRs in an incremental way. For starters, would
it help if an automated email went out to the dev list once a week that a)
reported the number of stale PRs, and b) directly linked to the 5 least
recently updated PRs?

Nick

On Sat Aug 30 2014 at 3:41:39 AM Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 it's actually precedurally difficult for us to close pull requests


 Just an FYI: Seems like the GitHub-sanctioned work-around to having
 issues-only permissions is to have a second, issues-only repository
 https://help.github.com/articles/issues-only-access-permissions. Not a
 very attractive work-around...

 Nick

Re: Handling stale PRs

2014-12-08 Thread Ganelin, Ilya

Thank you for pointing this out, Nick. I know that for myself and my
colleague who are starting to contribute to Spark, it¹s definitely
discouraging to have fixes sitting in the pipeline. Could you recommend
any other ways that we can facilitate getting these PRs accepted? Clean,
well-tested code is an obvious one but I¹d like to know if there are some
non-obvious things we (as contributors) could do to make the committers¹
lives easier? Thanks!

-Ilya 

On 12/8/14, 11:58 AM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:

I recently came across this blog post, which reminded me of this thread.

How to Discourage Open Source Contributions
http://danluu.com/discourage-oss/

We are currently at 320+ open PRs, many of which haven't been updated in
over a month. We have quite a few PRs that haven't been touched in 3-5
months.

*If you have the time and interest, please hop on over to the Spark PR
Dashboard https://spark-prs.appspot.com/, sort the PRs by
least-recently-updated, and update them where you can.*

I share the blog author's opinion that letting PRs go stale discourages
contributions, especially from first-time contributors, and especially
more
so when the PR author is waiting on feedback from a committer or
contributor.

I've been thinking about simple ways to make it easier for all of us to
chip in on controlling stale PRs in an incremental way. For starters,
would
it help if an automated email went out to the dev list once a week that a)
reported the number of stale PRs, and b) directly linked to the 5 least
recently updated PRs?

Nick

On Sat Aug 30 2014 at 3:41:39 AM Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 it's actually precedurally difficult for us to close pull requests


 Just an FYI: Seems like the GitHub-sanctioned work-around to having
 issues-only permissions is to have a second, issues-only repository
 https://help.github.com/articles/issues-only-access-permissions. Not a
 very attractive work-around...

 Nick




The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Handling stale PRs

2014-12-08 Thread Nicholas Chammas

Things that help:

   - Be persistent. People are busy, so just ping them if there’s been no
   response for a couple of weeks. Hopefully, as the project continues to
   develop, this will become less necessary.
   - Only ping reviewers after test results are back from Jenkins. Make
   sure all the tests are clear before reaching out, unless you need help
   understanding why a test is failing.
   - Whenever possible, keep PRs small, small, small.
   - Get buy-in on the dev list before working on something, especially
   larger features, to make sure you are making something that people
   understand and that is in accordance with Spark’s design.

I’m just speaking as a random contributor here, so don’t take this advice
as gospel.

Nick


On Mon Dec 08 2014 at 3:08:02 PM Ganelin, Ilya ilya.gane...@capitalone.com
wrote:

 Thank you for pointing this out, Nick. I know that for myself and my
 colleague who are starting to contribute to Spark, it¹s definitely
 discouraging to have fixes sitting in the pipeline. Could you recommend
 any other ways that we can facilitate getting these PRs accepted? Clean,
 well-tested code is an obvious one but I¹d like to know if there are some
 non-obvious things we (as contributors) could do to make the committers¹
 lives easier? Thanks!

 -Ilya

 On 12/8/14, 11:58 AM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 I recently came across this blog post, which reminded me of this thread.
 
 How to Discourage Open Source Contributions
 http://danluu.com/discourage-oss/
 
 We are currently at 320+ open PRs, many of which haven't been updated in
 over a month. We have quite a few PRs that haven't been touched in 3-5
 months.
 
 *If you have the time and interest, please hop on over to the Spark PR
 Dashboard https://spark-prs.appspot.com/, sort the PRs by
 least-recently-updated, and update them where you can.*
 
 I share the blog author's opinion that letting PRs go stale discourages
 contributions, especially from first-time contributors, and especially
 more
 so when the PR author is waiting on feedback from a committer or
 contributor.
 
 I've been thinking about simple ways to make it easier for all of us to
 chip in on controlling stale PRs in an incremental way. For starters,
 would
 it help if an automated email went out to the dev list once a week that a)
 reported the number of stale PRs, and b) directly linked to the 5 least
 recently updated PRs?
 
 Nick
 
 On Sat Aug 30 2014 at 3:41:39 AM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:
 
  On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  it's actually precedurally difficult for us to close pull requests
 
 
  Just an FYI: Seems like the GitHub-sanctioned work-around to having
  issues-only permissions is to have a second, issues-only repository
  https://help.github.com/articles/issues-only-access-permissions. Not
 a
  very attractive work-around...
 
  Nick
 

 

 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed.  If the reader of this message is not the
 intended recipient, you are hereby notified that any review,
 retransmission, dissemination, distribution, copying or other use of, or
 taking of any action in reliance upon this information is strictly
 prohibited. If you have received this communication in error, please
 contact the sender and delete the material from your computer.

Re: scala.MatchError on SparkSQL when creating ArrayType of StructType

2014-12-08 Thread Yin Huai

Seems you hit https://issues.apache.org/jira/browse/SPARK-4245. It was
fixed in 1.2.

Thanks,

Yin

On Wed, Dec 3, 2014 at 11:50 AM, invkrh inv...@gmail.com wrote:

 Hi,

 I am using SparkSQL on 1.1.0 branch.

 The following code leads to a scala.MatchError
 at

 org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247)

 val scm = StructType(*inputRDD*.schema.fields.init :+
   StructField(list,
 ArrayType(
   StructType(
 Seq(StructField(*date*, StringType, nullable = *false*),
   StructField(*nbPurchase*, IntegerType, nullable =
 *false*,
 nullable = false))

 // *purchaseRDD* is RDD[sql.ROW] whose schema is corresponding to scm. It
 is
 transformed from *inputRDD*
 val schemaRDD = hiveContext.applySchema(purchaseRDD, scm)
 schemaRDD.registerTempTable(t_purchase)

 Here's the stackTrace:
 scala.MatchError: ArrayType(StructType(List(StructField(date,StringType,
 *true* ), StructField(n_reachat,IntegerType, *true* ))),true) (of class
 org.apache.spark.sql.catalyst.types.ArrayType)
 at

 org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247)
 at
 org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
 at
 org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
 at

 org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84)
 at

 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:66)
 at

 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:50)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org
 $apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149)
 at

 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 at

 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)

 The strange thing is that *nullable* of *date* and *nbPurchase* field are
 set to true while it were false in the code. If I set both to *true*, it
 works. But, in fact, they should not be nullable.

 Here's what I find at Cast.scala:247 on 1.1.0 branch

   private[this] lazy val cast: Any = Any = dataType match {
 case StringType = castToString
 case BinaryType = castToBinary
 case DecimalType = castToDecimal
 case TimestampType = castToTimestamp
 case BooleanType = castToBoolean
 case ByteType = castToByte
 case ShortType = castToShort
 case IntegerType = castToInt
 case FloatType = castToFloat
 case LongType = castToLong
 case DoubleType = castToDouble
   }

 Any idea? Thank you.

 Hao



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/scala-MatchError-on-SparkSQL-when-creating-ArrayType-of-StructType-tp9623.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0

2014-12-08 Thread Michael Armbrust

This is merged now and should be fixed in the next 1.2 RC.

On Sat, Dec 6, 2014 at 8:28 PM, Cheng, Hao hao.ch...@intel.com wrote:

 I've created(reused) the PR https://github.com/apache/spark/pull/3336,
 hopefully we can fix this regression.

 Thanks for the reporting.

 Cheng Hao

 -Original Message-
 From: Michael Armbrust [mailto:mich...@databricks.com]
 Sent: Saturday, December 6, 2014 4:51 AM
 To: kb
 Cc: d...@spark.incubator.apache.org; Cheng Hao
 Subject: Re: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0

 Thanks for reporting.  This looks like a regression related to:
 https://github.com/apache/spark/pull/2570

 I've filed it here: https://issues.apache.org/jira/browse/SPARK-4769

 On Fri, Dec 5, 2014 at 12:03 PM, kb kend...@hotmail.com wrote:

  I am having trouble getting create table as select or saveAsTable
  from a hiveContext to work with temp tables in spark 1.2.  No issues
  in 1.1.0 or
  1.1.1
 
  Simple modification to test case in the hive SQLQuerySuite.scala:
 
  test(double nested data) {
  sparkContext.parallelize(Nested1(Nested2(Nested3(1))) ::
  Nil).registerTempTable(nested)
  checkAnswer(
sql(SELECT f1.f2.f3 FROM nested),
1)
  checkAnswer(sql(CREATE TABLE test_ctas_1234 AS SELECT * from
  nested),
  Seq.empty[Row])
  checkAnswer(
sql(SELECT * FROM test_ctas_1234),
sql(SELECT * FROM nested).collect().toSeq)
}
 
 
  output:
 
  11:57:15.974 ERROR org.apache.hadoop.hive.ql.parse.SemanticAnalyzer:
  org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:45 Table not
  found 'nested'
  at
 
 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1243)
  at
 
 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1192)
  at
 
 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9209)
  at
 
 
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327)
  at
 
 
 org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation$lzycompute(CreateTableAsSelect.scala:59)
  at
 
 
 org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation(CreateTableAsSelect.scala:55)
  at
 
 
 org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult$lzycompute(CreateTableAsSelect.scala:82)
  at
 
 
 org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult(CreateTableAsSelect.scala:70)
  at
 
 
 org.apache.spark.sql.hive.execution.CreateTableAsSelect.execute(CreateTableAsSelect.scala:89)
  at
 
 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
  at
 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
  at
  org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
  at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:105)
  at
 org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:103)
  at
 
 
 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply$mcV$sp(SQLQuerySuite.scala:122)
  at
 
 
 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply(SQLQuerySuite.scala:117)
  at
 
 
 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply(SQLQuerySuite.scala:117)
  at
 
 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at
 org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
  at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
  at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
  at
 
 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
  at
 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
  at
 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
  at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
  at
 org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
  at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
  at
 
 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  at
 
 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  at
 
 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
  at
 
 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
  at scala.collection.immutable.List.foreach(List.scala:318)
  at

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-08 Thread Jianshi Huang

Ah... I see. Thanks for pointing it out.

Then it means we cannot mount external table using customized column names.
hmm...

Then the only option left is to use a subquery to add a bunch of column
alias. I'll try it later.

Thanks,
Jianshi

On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust mich...@databricks.com
wrote:

 This is by hive's design.  From the Hive documentation:

 The column change command will only modify Hive's metadata, and will not
 modify data. Users should make sure the actual data layout of the
 table/partition conforms with the metadata definition.



 On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Ok, found another possible bug in Hive.

 My current solution is to use ALTER TABLE CHANGE to rename the column
 names.

 The problem is after renaming the column names, the value of the columns
 became all NULL.

 Before renaming:
 scala sql(select `sorted::cre_ts` from pmt limit 1).collect
 res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])

 Execute renaming:
 scala sql(alter table pmt change `sorted::cre_ts` cre_ts string)
 res13: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[972] at RDD at SchemaRDD.scala:108
 == Query Plan ==
 Native command: executed by Hive

 After renaming:
 scala sql(select cre_ts from pmt limit 1).collect
 res16: Array[org.apache.spark.sql.Row] = Array([null])

 I created a JIRA for it:

   https://issues.apache.org/jira/browse/SPARK-4781


 Jianshi

 On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hmm... another issue I found doing this approach is that ANALYZE TABLE
 ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
 later broadcast join and such will fail...

 Any idea how to fix this issue?

 Jianshi

 On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Very interesting, the line doing drop table will throws an exception.
 After removing it all works.

 Jianshi

 On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Here's the solution I got after talking with Liancheng:

 1) using backquote `..` to wrap up all illegal characters

 val rdd = parquetFile(file)
 val schema = rdd.schema.fields.map(f = s`${f.name}`
 ${HiveMetastoreTypes.toMetastoreType(f.dataType)}).mkString(,\n)

 val ddl_13 = s
   |CREATE EXTERNAL TABLE $name (
   |  $schema
   |)
   |STORED AS PARQUET
   |LOCATION '$file'
   .stripMargin

 sql(ddl_13)

 2) create a new Schema and do applySchema to generate a new SchemaRDD,
 had to drop and register table

 val t = table(name)
 val newSchema = StructType(t.schema.fields.map(s = s.copy(name =
 s.name.replaceAll(.*?::, 
 sql(sdrop table $name)
 applySchema(t, newSchema).registerTempTable(name)

 I'm testing it for now.

 Thanks for the help!


 Jianshi

 On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang jianshi.hu...@gmail.com
  wrote:

 Hi,

 I had to use Pig for some preprocessing and to generate Parquet files
 for Spark to consume.

 However, due to Pig's limitation, the generated schema contains Pig's
 identifier

 e.g.
 sorted::id, sorted::cre_ts, ...

 I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.

   create external table pmt (
 sorted::id bigint
   )
   stored as parquet
   location '...'

 Obviously it didn't work, I also tried removing the identifier
 sorted::, but the resulting rows contain only nulls.

 Any idea how to create a table in HiveContext from these Parquet
 files?

 Thanks,
 Jianshi
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/

Understanding reported times on the Spark UI [+ Streaming]

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

Re: Handling stale PRs

Re: Handling stale PRs

Re: Handling stale PRs

Re: scala.MatchError on SparkSQL when creating ArrayType of StructType

Re: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

8 matches

Site Navigation

Mail list logo

Footer information