Sorry my expression may not be very clear, that is the case, not importing HDFS disk usage DFS Used: 54.19 GB Importing data from HDFS to HBase HDFS usage is DFS Used: 57.16 GB, in my HDFS storage data size is 69MB, HDFS rep is 3
2013/9/9 Shahab Yunus <[email protected]> > Some quick thoughts, well your size is bound to increase because recall > that the rowkey is stored in every cell. So when in CSV if you have let us > say 5 columns and when you imported them to HBASE using the first column as > key, then you will end up with essentially 9 (1 for the rowkey and then 2 > each for rest of 4 'rowkey-column' pairs) columns (I know very crude and > high-level estimation.) > > Also, how are you measuring the size in hdfs after import to HBase. Are you > excluding any replication of data? > > Regards, > Shahab > > > On Mon, Sep 9, 2013 at 5:06 AM, kun yan <[email protected]> wrote: > > > Hello everyone, I wrote a mapreduce program to import data(HDFS) into > > hbase, but when I import data into hbase later increased a lot, my > original > > data size is 69MB (HDFS), import HBase, My HDFS increase the size 3GB, I > > wrote the program do what is wrong > > > > thanks > > > > public class MRImportHBaseCsv { > > public static void main(String[] args) throws IOException, > > InterruptedException, ClassNotFoundException { > > > > Configuration conf = new Configuration(); > > conf.set("fs.defaultFS", "hdfs://hydra0001:8020"); > > conf.set("yarn.resourcemanager.address", "hydra0001:8032"); > > Job job = createSubmitTableJob(conf, args); > > job.submit(); > > > > } > > public static Job createSubmitTableJob(Configuration conf, String[] > > args) > > throws IOException { > > String tableName = args[0]; > > Path inputDir = new Path(args[1]); > > Job job = new Job(conf, "HDFS_TO_HBase"); > > job.setJarByClass(HourlyImporter.class); > > FileInputFormat.setInputPaths(job, inputDir); > > job.setInputFormatClass(TextInputFormat.class); > > job.setMapperClass(HourlyImporter.class); > > // ++++ insert into table directly using TableOutputFormat ++++ > > TableMapReduceUtil.initTableReducerJob(tableName, null, job); > > job.setNumReduceTasks(0); > > TableMapReduceUtil.addDependencyJars(job); > > return job; > > } > > > > static class HourlyImporter extends > > Mapper<LongWritable, Text, ImmutableBytesWritable, Put> { > > > > private long ts; > > // var column family > > static byte[] family = Bytes.toBytes("s"); > > static String columns = > "HBASE_ROW_KEY,STATION,YEAR,MONTH,DAY,HOUR,MINUTE"; > > > > @Override > > protected void cleanup(Context context) throws IOException, > > InterruptedException { > > ts = System.currentTimeMillis(); > > } > > > > @Override > > protected void map(LongWritable key, Text value, Context context) > > throws IOException, InterruptedException { > > > > ArrayList<String> columnsList = Lists.newArrayList(Splitter.on(',') > > .trimResults().split(columns)); > > > > String line = value.toString(); > > ArrayList<String> columnValues = Lists.newArrayList(Splitter > > .on(',').trimResults().split(line)); > > byte[] bRowKey = Bytes.toBytes(columnValues.get(0)); > > > > ImmutableBytesWritable rowKey = new ImmutableBytesWritable(bRowKey); > > > > > > Put p = new Put(Bytes.toBytes(columnValues.get(0))); > > for (int i = 1; i < columnValues.size(); i++) { > > p.add(family, Bytes.toBytes(columnsList.get(i)), > > Bytes.toBytes(columnValues.get(i))); > > } > > context.write(rowKey, p); > > } > > } > > } > > > > > > -- > > > > In the Hadoop world, I am just a novice, explore the entire Hadoop > > ecosystem, I hope one day I can contribute their own code > > > > YanBit > > [email protected] > > > -- In the Hadoop world, I am just a novice, explore the entire Hadoop ecosystem, I hope one day I can contribute their own code YanBit [email protected]
