[jira] [Updated] (HBASE-7697) Consolidate tools for getting data into, out of HBase

2013-01-28 Thread Nick Dimiduk (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Dimiduk updated HBASE-7697:


Description: 
The user experience for importing data into HBase and getting a dump out of 
HBase is pretty poor. The existing tools as I understand them include:
- org.apache.hadoop.hbase.mapreduce.Export,
- org.apache.hadoop.hbase.mapreduce.Import,
- org.apache.hadoop.hbase.mapreduce.ImportTsv,
- org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles, and
- org.apache.hadoop.hbase.mapreduce.CopyTable

Each one provides specific features that do not necessarily overlap with the 
others. For instance, Import and ImportTsv could have most of their logic 
combined, sharing common driver code and leaving the details of the file-format 
up to the user to provide via a pluggable mapper. Export and CopyTable both map 
over a target table; it's only the detail of what they do with the data that is 
different. Bulk operations via HFiles could be a more common use-case as well, 
not just a special case of ImportTsv.

The list of [open 
issues|https://issues.apache.org/jira/issues/?filter=-1jql=project%20%3D%20HBASE%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)%20AND%20text%20~%20%22ImportTsv%22%20ORDER%20BY%20updatedDate%20DESC]
 against ImportTsv alone indicates users are using the tool, and I certainly 
advise it for people getting started with a new HBase deployment.

I propose a single interface for getting data into and out of HBase. It would 
be pluggable, allowing users to override details of their file formats and 
schemas. We can provide implementations that replicate existing tool behaviors 
as example modules. These tools are also a reasonable place, IMHO, to include 
support for creation and loading of snapshots.

I started down the path of a specific tool intended to overcome some of the 
limitations of ImportTsv and it has since refactored into a more general 
purpose application. Initial patches forthcoming. Comments strongly encouraged.

  was:
The user experience for importing data into HBase and getting a dump out of 
HBase is pretty poor. The existing tools as I understand them include:
- org.apache.hadoop.hbase.mapreduce.Export,
- org.apache.hadoop.hbase.mapreduce.Import,
- org.apache.hadoop.hbase.mapreduce.ImportTsv,
- org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles, and
- org.apache.hadoop.hbase.mapreduce.CopyTable

Each one provides specific features that doen't necessarily overlap with the 
others. For instance, Import and ImportTsv could have most of their logic 
combined, sharing common driver code and leaving the details of the file-format 
up to the user to provide via a pluggable mapper. Export and CopyTable both map 
over a target table; it's only the detail of what they do with the data that is 
different. Bulk operations via HFiles could be a more common use-case as well, 
not just a special case of ImportTsv.

The list of [open 
issues|https://issues.apache.org/jira/issues/?filter=-1jql=project%20%3D%20HBASE%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)%20AND%20text%20~%20%22ImportTsv%22%20ORDER%20BY%20updatedDate%20DESC]
 against ImportTsv alone indicates users are using the tools, and I certainly 
advise it for people getting started with a new HBase deployment.

I propose a single interface for getting data into and out of HBase. It would 
be pluggable, allowing users to override details of their file formats and 
schemas. We can provide implementations that replicate existing tool behaviors 
as example modules. These tools are also a reasonable place, IMHO, to include 
support for creation and loading of snapshots.

I started down the path of a specific tool intended to overcome some of the 
limitations of ImportTsv and it has since refactored into a more general 
purpose application. Initial patches forthcoming. Comments strongly encourages.


 Consolidate tools for getting data into, out of HBase
 -

 Key: HBASE-7697
 URL: https://issues.apache.org/jira/browse/HBASE-7697
 Project: HBase
  Issue Type: Improvement
  Components: Client
Reporter: Nick Dimiduk
Assignee: Nick Dimiduk

 The user experience for importing data into HBase and getting a dump out of 
 HBase is pretty poor. The existing tools as I understand them include:
 - org.apache.hadoop.hbase.mapreduce.Export,
 - org.apache.hadoop.hbase.mapreduce.Import,
 - org.apache.hadoop.hbase.mapreduce.ImportTsv,
 - org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles, and
 - org.apache.hadoop.hbase.mapreduce.CopyTable
 Each one provides specific features that do not necessarily overlap with the 
 others. For instance, Import and ImportTsv could have most of their logic 
 combined, 

[jira] [Updated] (HBASE-7697) Consolidate tools for getting data into, out of HBase

2013-01-28 Thread Jonathan Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hsieh updated HBASE-7697:
--

Component/s: mapreduce

 Consolidate tools for getting data into, out of HBase
 -

 Key: HBASE-7697
 URL: https://issues.apache.org/jira/browse/HBASE-7697
 Project: HBase
  Issue Type: Improvement
  Components: Client, mapreduce
Reporter: Nick Dimiduk
Assignee: Nick Dimiduk

 The user experience for importing data into HBase and getting a dump out of 
 HBase is pretty poor. The existing tools as I understand them include:
 - org.apache.hadoop.hbase.mapreduce.Export,
 - org.apache.hadoop.hbase.mapreduce.Import,
 - org.apache.hadoop.hbase.mapreduce.ImportTsv,
 - org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles, and
 - org.apache.hadoop.hbase.mapreduce.CopyTable
 Each one provides specific features that do not necessarily overlap with the 
 others. For instance, Import and ImportTsv could have most of their logic 
 combined, sharing common driver code and leaving the details of the 
 file-format up to the user to provide via a pluggable mapper. Export and 
 CopyTable both map over a target table; it's only the detail of what they do 
 with the data that is different. Bulk operations via HFiles could be a more 
 common use-case as well, not just a special case of ImportTsv.
 The list of [open 
 issues|https://issues.apache.org/jira/issues/?filter=-1jql=project%20%3D%20HBASE%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)%20AND%20text%20~%20%22ImportTsv%22%20ORDER%20BY%20updatedDate%20DESC]
  against ImportTsv alone indicates users are using the tool, and I certainly 
 advise it for people getting started with a new HBase deployment.
 I propose a single interface for getting data into and out of HBase. It would 
 be pluggable, allowing users to override details of their file formats and 
 schemas. We can provide implementations that replicate existing tool 
 behaviors as example modules. These tools are also a reasonable place, IMHO, 
 to include support for creation and loading of snapshots.
 I started down the path of a specific tool intended to overcome some of the 
 limitations of ImportTsv and it has since refactored into a more general 
 purpose application. Initial patches forthcoming. Comments strongly 
 encouraged.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira