[
https://issues.apache.org/jira/browse/HBASE-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nick Dimiduk updated HBASE-7697:
Description:
The user experience for importing data into HBase and getting a dump out of
HBase is pretty poor. The existing tools as I understand them include:
- org.apache.hadoop.hbase.mapreduce.Export,
- org.apache.hadoop.hbase.mapreduce.Import,
- org.apache.hadoop.hbase.mapreduce.ImportTsv,
- org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles, and
- org.apache.hadoop.hbase.mapreduce.CopyTable
Each one provides specific features that do not necessarily overlap with the
others. For instance, Import and ImportTsv could have most of their logic
combined, sharing common driver code and leaving the details of the file-format
up to the user to provide via a pluggable mapper. Export and CopyTable both map
over a target table; it's only the detail of what they do with the data that is
different. Bulk operations via HFiles could be a more common use-case as well,
not just a special case of ImportTsv.
The list of [open
issues|https://issues.apache.org/jira/issues/?filter=-1jql=project%20%3D%20HBASE%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)%20AND%20text%20~%20%22ImportTsv%22%20ORDER%20BY%20updatedDate%20DESC]
against ImportTsv alone indicates users are using the tool, and I certainly
advise it for people getting started with a new HBase deployment.
I propose a single interface for getting data into and out of HBase. It would
be pluggable, allowing users to override details of their file formats and
schemas. We can provide implementations that replicate existing tool behaviors
as example modules. These tools are also a reasonable place, IMHO, to include
support for creation and loading of snapshots.
I started down the path of a specific tool intended to overcome some of the
limitations of ImportTsv and it has since refactored into a more general
purpose application. Initial patches forthcoming. Comments strongly encouraged.
was:
The user experience for importing data into HBase and getting a dump out of
HBase is pretty poor. The existing tools as I understand them include:
- org.apache.hadoop.hbase.mapreduce.Export,
- org.apache.hadoop.hbase.mapreduce.Import,
- org.apache.hadoop.hbase.mapreduce.ImportTsv,
- org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles, and
- org.apache.hadoop.hbase.mapreduce.CopyTable
Each one provides specific features that doen't necessarily overlap with the
others. For instance, Import and ImportTsv could have most of their logic
combined, sharing common driver code and leaving the details of the file-format
up to the user to provide via a pluggable mapper. Export and CopyTable both map
over a target table; it's only the detail of what they do with the data that is
different. Bulk operations via HFiles could be a more common use-case as well,
not just a special case of ImportTsv.
The list of [open
issues|https://issues.apache.org/jira/issues/?filter=-1jql=project%20%3D%20HBASE%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)%20AND%20text%20~%20%22ImportTsv%22%20ORDER%20BY%20updatedDate%20DESC]
against ImportTsv alone indicates users are using the tools, and I certainly
advise it for people getting started with a new HBase deployment.
I propose a single interface for getting data into and out of HBase. It would
be pluggable, allowing users to override details of their file formats and
schemas. We can provide implementations that replicate existing tool behaviors
as example modules. These tools are also a reasonable place, IMHO, to include
support for creation and loading of snapshots.
I started down the path of a specific tool intended to overcome some of the
limitations of ImportTsv and it has since refactored into a more general
purpose application. Initial patches forthcoming. Comments strongly encourages.
Consolidate tools for getting data into, out of HBase
-
Key: HBASE-7697
URL: https://issues.apache.org/jira/browse/HBASE-7697
Project: HBase
Issue Type: Improvement
Components: Client
Reporter: Nick Dimiduk
Assignee: Nick Dimiduk
The user experience for importing data into HBase and getting a dump out of
HBase is pretty poor. The existing tools as I understand them include:
- org.apache.hadoop.hbase.mapreduce.Export,
- org.apache.hadoop.hbase.mapreduce.Import,
- org.apache.hadoop.hbase.mapreduce.ImportTsv,
- org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles, and
- org.apache.hadoop.hbase.mapreduce.CopyTable
Each one provides specific features that do not necessarily overlap with the
others. For instance, Import and ImportTsv could have most of their logic
combined,