Hi,
I have compiled the latest version of CarbonData which is compatible with
HDP2.6. I’m doing the following steps but the data are never copied to the
table.
Start Spark Shell:
/home/ubuntu/carbondata# spark-shell --jars /home/ubuntu/carbondata/
carbondata_2.11-1.2.0-SNAPSHOT-shade-hadoop2.7.2.jar
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0.2.6.0.3-8
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
scala> import org.apache.spark.sql.CarbonSession._
import org.apache.spark.sql.CarbonSession._
scala> val carbon = SparkSession.builder().config(sc.getConf).
getOrCreateCarbonSession("/test/carbondata/","/test/carbondata/")
17/07/26 14:58:42 WARN SparkContext: Using an existing SparkContext; some
configuration may not take effect.
17/07/26 14:58:42 WARN CarbonProperties: main The enable unsafe sort value
"null" is invalid. Using the default value "false
17/07/26 14:58:42 WARN CarbonProperties: main The custom block distribution
value "null" is invalid. Using the default value "false
17/07/26 14:58:42 WARN CarbonProperties: main The enable vector reader
value "null" is invalid. Using the default value "true
17/07/26 14:58:42 WARN CarbonProperties: main The value "null" configured
for key carbon.lock.type" is invalid. Using the default value "HDFSLOCK
carbon: org.apache.spark.sql.SparkSession = org.apache.spark.sql.
CarbonSession@5f7bd970
scala> carbon.sql("CREATE TABLE IF NOT EXISTS test_carbon(id string, name
string, city string,age Int) STORED BY 'carbondata'")
17/07/26 15:04:35 AUDIT CreateTable:
[gateway-dc1r04n01][hdfs][Thread-1]Creating
Table with Database name [default] and Table name [test_carbon]
17/07/26 15:04:36 WARN HiveExternalCatalog: Couldn't find corresponding
Hive SerDe for data source provider org.apache.spark.sql.CarbonSource.
Persisting data source table `default`.`test_carbon` into Hive metastore in
Spark SQL specific format, which is NOT compatible with Hive.
17/07/26 15:04:36 AUDIT CreateTable: [gateway-dc1][hdfs][Thread-1]Table
created with Database name [default] and Table name [test_carbon]
res7: org.apache.spark.sql.DataFrame = []
scala> carbon.sql("describe test_carbon").show()
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
| id| string| null|
| name| string| null|
| city| string| null|
| age| int| null|
+--------+---------+-------+
scala> carbon.sql("INSERT INTO test_carbon VALUES(1,'x1','x2',34)")
17/07/26 15:07:25 AUDIT CarbonDataRDDFactory$:
[gateway-dc1][hdfs][Thread-1]Data
load request has been received for table default.test_carbon
17/07/26 15:07:25 WARN CarbonDataProcessorUtil: main sort scope is set to
LOCAL_SORT
17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker
for task 5 sort scope is set to LOCAL_SORT
17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker
for task 5 batch sort size is set to 0
17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker
for task 5 sort scope is set to LOCAL_SORT
17/07/26 15:07:25 WARN CarbonDataProcessorUtil: Executor task launch worker
for task 5 sort scope is set to LOCAL_SORT
17/07/26 15:07:25 AUDIT CarbonDataRDDFactory$:
[gateway-dc1r04n01][hdfs][Thread-1]Data
load is successful for default.test_carbon
res11: org.apache.spark.sql.DataFrame = []
scala> carbon.sql("LOAD DATA INPATH 'hdfs://xxxx/test/carbondata/sample.csv'
INTO TABLE test_carbon")
17/07/26 14:59:28 AUDIT CarbonDataRDDFactory$:
[gateway-dc1][hdfs][Thread-1]Data
load request has been received for table default.test_table
17/07/26 14:59:28 WARN CarbonDataProcessorUtil: main sort scope is set to
LOCAL_SORT
17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch
worker for task
0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e]
sort scope is set to LOCAL_SORT
17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch
worker for task
0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e]
batch sort size is set to 0
17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch
worker for task
0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e]
sort scope is set to LOCAL_SORT
17/07/26 14:59:28 WARN CarbonDataProcessorUtil: [Executor task launch
worker for task
0][partitionID:default_test_table_8662d5ff-9392-4e23-b37e-9a4485f71f0e]
sort scope is set to LOCAL_SORT
17/07/26 14:59:29 AUDIT CarbonDataRDDFactory$:
[gateway-dc1][hdfs][Thread-1]Data
load is successful for default.test_table
res1: org.apache.spark.sql.DataFrame = []
scala> carbon.sql("Select * from test_carbon").show()
java.io.FileNotFoundException: File
/test/carbondata/default/test_table/Fact/Part0/Segment_0
does not exist.
at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(
DistributedFileSystem.java:1081)
at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(
DistributedFileSystem.java:1059)
at org.apache.hadoop.hdfs.DistributedFileSystem$23.
doCall(DistributedFileSystem.java:1004)
at org.apache.hadoop.hdfs.DistributedFileSystem$23.
doCall(DistributedFileSystem.java:1000)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(
FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(
DistributedFileSystem.java:1000)
at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1735)
at org.apache.carbondata.hadoop.CarbonInputFormat.getFileStatusInternal(
CarbonInputFormat.java:862)
at org.apache.carbondata.hadoop.CarbonInputFormat.getFileStatus(
CarbonInputFormat.java:845)
at org.apache.carbondata.hadoop.CarbonInputFormat.listStatus(
CarbonInputFormat.java:802)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.
getSplits(FileInputFormat.java:387)
at org.apache.carbondata.hadoop.CarbonInputFormat.getSplitsInternal(
CarbonInputFormat.java:319)
at org.apache.carbondata.hadoop.CarbonInputFormat.getTableBlockInfo(
CarbonInputFormat.java:523)
at org.apache.carbondata.hadoop.CarbonInputFormat.
getSegmentAbstractIndexs(CarbonInputFormat.java:616)
at org.apache.carbondata.hadoop.CarbonInputFormat.getDataBlocksOfSegment(
CarbonInputFormat.java:441)
at org.apache.carbondata.hadoop.CarbonInputFormat.getSplits(
CarbonInputFormat.java:379)
at org.apache.carbondata.hadoop.CarbonInputFormat.getSplits(
CarbonInputFormat.java:302)
at org.apache.carbondata.spark.rdd.CarbonScanRDD.
getPartitions(CarbonScanRDD.scala:81)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(
MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(
MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:
311)
at org.apache.spark.sql.execution.CollectLimitExec.
executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$
Dataset$$execute$1$1.apply(Dataset.scala:2378)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(
SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2780)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$
execute$1(Dataset.scala:2377)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$
collect(Dataset.scala:2384)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2120)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2119)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2810)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2119)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2334)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
at org.apache.spark.sql.Dataset.show(Dataset.scala:638)
at org.apache.spark.sql.Dataset.show(Dataset.scala:597)
at org.apache.spark.sql.Dataset.show(Dataset.scala:606)
... 50 elided
I have check the folder on HDFS and there is a structure
/test/carbondata/default/test_carbon/ but the folder is empty.
I’m pretty sure that I’m missing silly, but I have not been able to find a
way to insert data in the table.
On another subject, I’m trying to also access this through presto, but here
the error is always: Query 20170726_145207_00005_ytsnk failed: line 1:1:
Schema 'default' does not exist
I’m also a little bit lost as from Spark it seems that the table are
created in the hive metastore, but the Presto plugin doesn’t seem to refer
to it.
Thanks for reading!
AG