Issue with quickstart introduction

Arnaud G Wed, 26 Jul 2017 08:59:32 -0700

Hi,



I have compiled the latest version of CarbonData which is compatible with
HDP2.6. I’m doing the following steps but the data are never copied to the
table.



Start Spark Shell:

/home/ubuntu/carbondata# spark-shell --jars /home/ubuntu/carbondata/
carbondata_2.11-1.2.0-SNAPSHOT-shade-hadoop2.7.2.jar



Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0.2.6.0.3-8

      /_/



Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_121)

Type in expressions to have them evaluated.

Type :help for more information.



scala>  import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.SparkSession



scala> import org.apache.spark.sql.CarbonSession._

import org.apache.spark.sql.CarbonSession._



scala> val carbon = SparkSession.builder().config(sc.getConf).
getOrCreateCarbonSession("/test/carbondata/","/test/carbondata/")

17/07/26 14:58:42 WARN SparkContext: Using an existing SparkContext; some
configuration may not take effect.

17/07/26 14:58:42 WARN CarbonProperties: main The enable unsafe sort value
"null" is invalid. Using the default value "false

17/07/26 14:58:42 WARN CarbonProperties: main The custom block distribution
value "null" is invalid. Using the default value "false

17/07/26 14:58:42 WARN CarbonProperties: main The enable vector reader
value "null" is invalid. Using the default value "true

17/07/26 14:58:42 WARN CarbonProperties: main The value "null" configured
for key carbon.lock.type" is invalid. Using the default value "HDFSLOCK

carbon: org.apache.spark.sql.SparkSession = org.apache.spark.sql.
CarbonSession@5f7bd970



scala> carbon.sql("CREATE TABLE IF NOT EXISTS test_carbon(id string, name
string, city string,age Int)  STORED BY 'carbondata'")

17/07/26 15:04:35 AUDIT CreateTable:
[gateway-dc1r04n01][hdfs][Thread-1]Creating
Table with Database name [default] and Table name [test_carbon]

17/07/26 15:04:36 WARN HiveExternalCatalog: Couldn't find corresponding
Hive SerDe for data source provider org.apache.spark.sql.CarbonSource.
Persisting data source table `default`.`test_carbon` into Hive metastore in
Spark SQL specific format, which is NOT compatible with Hive.

17/07/26 15:04:36 AUDIT CreateTable: [gateway-dc1][hdfs][Thread-1]Table
created with Database name [default] and Table name [test_carbon]

res7: org.apache.spark.sql.DataFrame = []



scala> carbon.sql("describe test_carbon").show()

+--------+---------+-------+

|col_name|data_type|comment|

+--------+---------+-------+

|      id|   string|   null|

|    name|   string|   null|

|    city|   string|   null|

|     age|      int|   null|

+--------+---------+-------+





scala> carbon.sql("INSERT INTO test_carbon VALUES(1,'x1','x2',34)")

17/07/26 15:07:25 AUDIT CarbonDataRDDFactory$:
[gateway-dc1][hdfs][Thread-1]Data
load request has been received for table default.test_carbon

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: main sort scope is set to
LOCAL_SORT

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker
for task 5 sort scope is set to LOCAL_SORT

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker
for task 5 batch sort size is set to 0

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker
for task 5 sort scope is set to LOCAL_SORT

17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker
for task 5 sort scope is set to LOCAL_SORT

17/07/26 15:07:25 AUDIT CarbonDataRDDFactory$:
[gateway-dc1r04n01][hdfs][Thread-1]Data
load is successful for default.test_carbon

res11: org.apache.spark.sql.DataFrame = []



scala> carbon.sql("LOAD DATA INPATH 'hdfs://xxxx/test/carbondata/sample.csv'
INTO TABLE test_carbon")

17/07/26 14:59:28 AUDIT CarbonDataRDDFactory$:
[gateway-dc1][hdfs][Thread-1]Data
load request has been received for table default.test_table

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: main sort scope is set to
LOCAL_SORT

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch
worker for task
0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e]
sort scope is set to LOCAL_SORT

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch
worker for task
0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e]
batch sort size is set to 0

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch
worker for task
0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e]
sort scope is set to LOCAL_SORT

17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch
worker for task
0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e]
sort scope is set to LOCAL_SORT

17/07/26 14:59:29 AUDIT CarbonDataRDDFactory$:
[gateway-dc1][hdfs][Thread-1]Data
load is successful for default.test_table

res1: org.apache.spark.sql.DataFrame = []





scala> carbon.sql("Select * from test_carbon").show()

java.io.FileNotFoundException: File
/test/carbondata/default/test_table/Fact/Part0/Segment_0
does not exist.

  at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(
DistributedFileSystem.java:1081)

  at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(
DistributedFileSystem.java:1059)

  at org.apache.hadoop.hdfs.DistributedFileSystem$23.
doCall(DistributedFileSystem.java:1004)

  at org.apache.hadoop.hdfs.DistributedFileSystem$23.
doCall(DistributedFileSystem.java:1000)

  at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(
FileSystemLinkResolver.java:81)

  at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(
DistributedFileSystem.java:1000)

  at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1735)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getFileStatusInternal(
CarbonInputFormat.java:862)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getFileStatus(
CarbonInputFormat.java:845)

  at org.apache.carbondata.hadoop.CarbonInputFormat.listStatus(
CarbonInputFormat.java:802)

  at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.
getSplits(FileInputFormat.java:387)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getSplitsInternal(
CarbonInputFormat.java:319)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getTableBlockInfo(
CarbonInputFormat.java:523)

  at org.apache.carbondata.hadoop.CarbonInputFormat.
getSegmentAbstractIndexs(CarbonInputFormat.java:616)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getDataBlocksOfSegment(
CarbonInputFormat.java:441)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getSplits(
CarbonInputFormat.java:379)

  at org.apache.carbondata.hadoop.CarbonInputFormat.getSplits(
CarbonInputFormat.java:302)

  at org.apache.carbondata.spark.rdd.CarbonScanRDD.
getPartitions(CarbonScanRDD.scala:81)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(
MapPartitionsRDD.scala:35)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(
MapPartitionsRDD.scala:35)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:
311)

  at org.apache.spark.sql.execution.CollectLimitExec.
executeCollect(limit.scala:38)

  at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
Dataset$$execute$1$1.apply(Dataset.scala:2378)

  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(
SQLExecution.scala:57)

  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)

  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$
execute$1(Dataset.scala:2377)

  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$
collect(Dataset.scala:2384)

  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2120)

  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2119)

  at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2810)

  at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)

  at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)

  at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)

 at org.apache.spark.sql.Dataset.show(Dataset.scala:638)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:597)

  at org.apache.spark.sql.Dataset.show(Dataset.scala:606)

  ... 50 elided



I have check the folder on HDFS and there is a structure
/test/carbondata/default/test_carbon/ but the folder is empty.


I’m pretty sure that I’m missing silly, but I have not been able to find a
way to insert data in the table.



On another subject, I’m trying to also access this through presto, but here
the error is always: Query 20170726_145207_00005_ytsnk failed: line 1:1:
Schema 'default' does not exist



I’m also a little bit lost as from Spark it seems that the table are
created in the hive metastore, but the Presto plugin doesn’t seem to refer
to it.



Thanks for reading!


AG

Issue with quickstart introduction

Reply via email to