Re: Efficient way to read a large number of files in S3 and upload their content to HBase

2012-05-30 Thread Marcos Ortiz Valmaseda
Like I said before, I need to store all click streams of a advertising network 
to do later deep analysis for this huge data.
We want to store in two places:
- first to Amazon S3
- then to HBase

But I think that wen don't need S3 if we can store in a proper HBase cluster 
using the asynchbase library from StumbleUpon, and then we can create some 
MapReduce's jobs for analysis.

What do you think?

Regards
- Mensaje original -
De: Ian Varley ivar...@salesforce.com
Para: user@hbase.apache.org
Enviado: Thu, 24 May 2012 17:12:35 -0400 (CDT)
Asunto: Re: Efficient way to read a large number of files in S3 and upload 
their content to HBase

This is a question I see coming up a lot. Put differently: what characteristics 
make it useful to use HBase on top of HDFS, as opposed to just flat files in 
HDFS directly? Quantity isn't really an answer, b/c HDFS does fine with 
quantity (better, even).

The basic answers are that HBase is good if:

a) You want to be able to read random small bits of data in the middle of 
(large) HDFS files with low latency (i.e. without loading the whole thing from 
disk)
b) You want to be able to modify (insert) random small bits of data in the 
middle of (immutable, sorted) HDFS files without writing the whole thing out 
again each time.

If all you want is a way to quickly store a lot of data, it's hard to beat 
writing to flat files (only /dev/null/ is faster, but it doesn't support 
sharding). :) But if you want to then be able to do either (a) or (b) above, 
that's where you start looking at HBase. I assume in your case, you need 
sub-second access to single records (or ranges of records) anywhere in the set?

Ian

On May 24, 2012, at 1:53 PM, Marcos Ortiz wrote:



 On 05/24/2012 04:47 PM, Amandeep Khurana wrote:
 Thanks for that description. I'm not entirely sure why you want to use
 HBase here. You've got logs coming that you want to process in batch
 to do calculations on. This can be done by running MR jobs on the flat
 files itself. You could use Java MR, Hive or Pig to accomplish this.
 Why do you want HBase here?
 Tha main reason to use HBase is for the quantity of rows involved in the 
 process. It could provide a efficient and quick way to store all this.
 Hive can be an option too.

 I will discuss all this again with the dev team.
 Thanks a lot for your answers.

 -ak

 On Thursday, May 24, 2012 at 12:52 PM, Marcos Ortiz wrote:



 On 05/24/2012 03:21 PM, Amandeep Khurana wrote:
 Marcos

 Can you elaborate on your use case a little bit? What is the nature of
 data in S3 and why you want to use HBase? Why do you want to combine
 HFiles and upload back to S3? It'll help us answer your questions
 better.

 Amandeep
 Ok, let me explain more.
 We are working on a ads optimization platform on top of Hadoop and HBase.
 Another team of my organization create a type of log file per click
 by user
 and store this file in S3. I discussed with them that a better approach
 is to storage this
 workflow log in HBase, instead S3, because in this way, we can quit
 the another step
 to read from S3 the content of the file, build the HFile and upload it
 to HBase.

 The content of the file in S3 is the basic information for the operation:
 - Source URL
 - User Id
 - User agent of the user
 - Campaign id
 and more fields.

 So, we want this to then create MapReduce jobs on top of HBase to some
 calculations and reports
 for this data.

 We are valuating HBase because our current solution is on top of
 PostgreSQL, but the main issue is when you
 launch a campaign on the platform, the INSERTs and UPDATEs to PostgreSQL
 in a short time, could rise from 1 to
 100 clicks per second. We did some preliminary tests and in two days,
 the table where we store the workflow
 log grow exponentially to 350, 000 tuples, so, it could be a problem.
 For that reason, we want to migrate this to HBase.

 But I think that the approach to generate a file in S3 and then upload
 to HBase is not the best way to do this; because, you can always
 create the workflow log for every user, build a Put for it and upload it
 to HBase, and to avoid the locks, I´m valuating to use the asynchronous
 API released
 by StumbleUpon. [1]

 What do you think about this?

 [1] https://github.com/stumbleupon/asynchbase



 On May 24, 2012, at 12:19 PM, Marcos Ortizmlor...@uci.cu
 mailto:mlor...@uci.cu wrote:

 Thanks a lot for your answer, Amandeep.

 On 05/24/2012 02:55 PM, Amandeep Khurana wrote:
 Marcos,

 You could to a distcp from S3 to HDFS and then do a bulk import
 into HBase.
 The quantity of files are very large, so, we want to combine some
 files,
 and then construct
 the HFile to upload to HBase.
 Any example of a custom FileMerger for it?
 Are you running HBase on EC2 or on your own hardware?
 We have created a small HBase in our own hardware, but we want to build
 another cluster on top of Amazon EC2. This
 could be very good for the integration between S3 and the HBase
 cluster.

 Regards
 -Amandeep


 On 

Efficient way to read a large number of files in S3 and upload their content to HBase

2012-05-24 Thread Marcos Ortiz

Regards to all the list.
We are using Amazon S3 to store millions of files with certain format, 
and we want to read the content of these files and then upload its 
content to

a HBase cluster.
Anyone has done this?
Can you recommend me a efficient way to do this?

Best wishes.

--
Marcos Luis Ortíz Valmaseda
 Data Engineer  Sr. System Administrator at UCI
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: Efficient way to read a large number of files in S3 and upload their content to HBase

2012-05-24 Thread Amandeep Khurana
Marcos,

You could to a distcp from S3 to HDFS and then do a bulk import into HBase.

Are you running HBase on EC2 or on your own hardware?

-Amandeep  


On Thursday, May 24, 2012 at 11:52 AM, Marcos Ortiz wrote:

 Regards to all the list.
 We are using Amazon S3 to store millions of files with certain format,  
 and we want to read the content of these files and then upload its  
 content to
 a HBase cluster.
 Anyone has done this?
 Can you recommend me a efficient way to do this?
  
 Best wishes.
  
 --  
 Marcos Luis Ortíz Valmaseda
 Data Engineer Sr. System Administrator at UCI
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186
  
  
 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
  
 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci
  
  




Re: Efficient way to read a large number of files in S3 and upload their content to HBase

2012-05-24 Thread Marcos Ortiz

Thanks a lot for your answer, Amandeep.

On 05/24/2012 02:55 PM, Amandeep Khurana wrote:

Marcos,

You could to a distcp from S3 to HDFS and then do a bulk import into HBase.
The quantity of files are very large, so, we want to combine some files, 
and then construct

the HFile to upload to HBase.
Any example of a custom FileMerger for it?


Are you running HBase on EC2 or on your own hardware?
We have created a small HBase in our own hardware, but we want to build 
another cluster on top of Amazon EC2. This

could be very good for the integration between S3 and the HBase cluster.

Regards


-Amandeep


On Thursday, May 24, 2012 at 11:52 AM, Marcos Ortiz wrote:


Regards to all the list.
We are using Amazon S3 to store millions of files with certain format,
and we want to read the content of these files and then upload its
content to
a HBase cluster.
Anyone has done this?
Can you recommend me a efficient way to do this?

Best wishes.

--
Marcos Luis Ortíz Valmaseda
Data Engineer  Sr. System Administrator at UCI
http://marcosluis2186.posterous.com
http://www.linkedin.com/in/marcosluis2186
Twitter: @marcosluis2186


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci






10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci




--
Marcos Luis Ortíz Valmaseda
 Data Engineer  Sr. System Administrator at UCI
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: Efficient way to read a large number of files in S3 and upload their content to HBase

2012-05-24 Thread Amandeep Khurana
Marcos

Can you elaborate on your use case a little bit? What is the nature of
data in S3 and why you want to use HBase? Why do you want to combine
HFiles and upload back to S3? It'll help us answer your questions
better.

Amandeep

On May 24, 2012, at 12:19 PM, Marcos Ortiz mlor...@uci.cu wrote:

 Thanks a lot for your answer, Amandeep.

 On 05/24/2012 02:55 PM, Amandeep Khurana wrote:
 Marcos,

 You could to a distcp from S3 to HDFS and then do a bulk import into HBase.
 The quantity of files are very large, so, we want to combine some files,
 and then construct
 the HFile to upload to HBase.
 Any example of a custom FileMerger for it?

 Are you running HBase on EC2 or on your own hardware?
 We have created a small HBase in our own hardware, but we want to build
 another cluster on top of Amazon EC2. This
 could be very good for the integration between S3 and the HBase cluster.

 Regards

 -Amandeep


 On Thursday, May 24, 2012 at 11:52 AM, Marcos Ortiz wrote:

 Regards to all the list.
 We are using Amazon S3 to store millions of files with certain format,
 and we want to read the content of these files and then upload its
 content to
 a HBase cluster.
 Anyone has done this?
 Can you recommend me a efficient way to do this?

 Best wishes.

 --
 Marcos Luis Ortíz Valmaseda
 Data Engineer  Sr. System Administrator at UCI
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186


 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci





 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci



 --
 Marcos Luis Ortíz Valmaseda
 Data Engineer  Sr. System Administrator at UCI
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186


 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci


Re: Efficient way to read a large number of files in S3 and upload their content to HBase

2012-05-24 Thread Marcos Ortiz



On 05/24/2012 03:21 PM, Amandeep Khurana wrote:

Marcos

Can you elaborate on your use case a little bit? What is the nature of
data in S3 and why you want to use HBase? Why do you want to combine
HFiles and upload back to S3? It'll help us answer your questions
better.

Amandeep

Ok, let me explain more.
We are working on a ads optimization platform on top of Hadoop and HBase.
Another team of my organization create a type of log file per click by user
and store this file in S3. I discussed with them that a better approach 
is to storage this
workflow log in HBase, instead S3, because in this way, we can quit 
the another step
to read from S3 the content of the file, build the HFile and upload it 
to HBase.


The content of the file in S3 is the basic information for the operation:
- Source URL
- User Id
- User agent of the user
- Campaign id
and more fields.

So, we want this to then create MapReduce jobs on top of HBase to some 
calculations and reports

for this data.

We are valuating HBase because our current solution is on top of 
PostgreSQL, but the main issue is when you
launch a campaign on the platform, the INSERTs and UPDATEs to PostgreSQL 
in a short time, could rise from 1 to
100 clicks per second. We did some preliminary tests and in two days, 
the table where we store the workflow
log grow exponentially to 350, 000 tuples, so, it could be a problem. 
For that reason, we want to migrate this to HBase.


But I think that the approach to generate a file in S3 and then upload 
to HBase is not the best way to do this; because, you can always
create the workflow log for every user, build a Put for it and upload it 
to HBase, and to avoid the locks, I´m valuating to use the asynchronous 
API released

by StumbleUpon. [1]

What do you think about this?

[1] https://github.com/stumbleupon/asynchbase




On May 24, 2012, at 12:19 PM, Marcos Ortizmlor...@uci.cu  wrote:


Thanks a lot for your answer, Amandeep.

On 05/24/2012 02:55 PM, Amandeep Khurana wrote:

Marcos,

You could to a distcp from S3 to HDFS and then do a bulk import into HBase.

The quantity of files are very large, so, we want to combine some files,
and then construct
the HFile to upload to HBase.
Any example of a custom FileMerger for it?

Are you running HBase on EC2 or on your own hardware?

We have created a small HBase in our own hardware, but we want to build
another cluster on top of Amazon EC2. This
could be very good for the integration between S3 and the HBase cluster.

Regards

-Amandeep


On Thursday, May 24, 2012 at 11:52 AM, Marcos Ortiz wrote:


Regards to all the list.
We are using Amazon S3 to store millions of files with certain format,
and we want to read the content of these files and then upload its
content to
a HBase cluster.
Anyone has done this?
Can you recommend me a efficient way to do this?

Best wishes.

--
Marcos Luis Ortíz Valmaseda
Data Engineer   Sr. System Administrator at UCI
http://marcosluis2186.posterous.com
http://www.linkedin.com/in/marcosluis2186
Twitter: @marcosluis2186


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci





10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci



--
Marcos Luis Ortíz Valmaseda
Data Engineer   Sr. System Administrator at UCI
http://marcosluis2186.posterous.com
http://www.linkedin.com/in/marcosluis2186
Twitter: @marcosluis2186


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


--
Marcos Luis Ortíz Valmaseda
 Data Engineer  Sr. System Administrator at UCI
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: Efficient way to read a large number of files in S3 and upload their content to HBase

2012-05-24 Thread Amandeep Khurana
Thanks for that description. I'm not entirely sure why you want to use HBase 
here. You've got logs coming that you want to process in batch to do 
calculations on. This can be done by running MR jobs on the flat files itself. 
You could use Java MR, Hive or Pig to accomplish this. Why do you want HBase 
here?  

-ak  


On Thursday, May 24, 2012 at 12:52 PM, Marcos Ortiz wrote:

  
  
 On 05/24/2012 03:21 PM, Amandeep Khurana wrote:
  Marcos
   
  Can you elaborate on your use case a little bit? What is the nature of
  data in S3 and why you want to use HBase? Why do you want to combine
  HFiles and upload back to S3? It'll help us answer your questions
  better.
   
  Amandeep
 Ok, let me explain more.
 We are working on a ads optimization platform on top of Hadoop and HBase.
 Another team of my organization create a type of log file per click by user
 and store this file in S3. I discussed with them that a better approach  
 is to storage this
 workflow log in HBase, instead S3, because in this way, we can quit  
 the another step
 to read from S3 the content of the file, build the HFile and upload it  
 to HBase.
  
 The content of the file in S3 is the basic information for the operation:
 - Source URL
 - User Id
 - User agent of the user
 - Campaign id
 and more fields.
  
 So, we want this to then create MapReduce jobs on top of HBase to some  
 calculations and reports
 for this data.
  
 We are valuating HBase because our current solution is on top of  
 PostgreSQL, but the main issue is when you
 launch a campaign on the platform, the INSERTs and UPDATEs to PostgreSQL  
 in a short time, could rise from 1 to
 100 clicks per second. We did some preliminary tests and in two days,  
 the table where we store the workflow
 log grow exponentially to 350, 000 tuples, so, it could be a problem.  
 For that reason, we want to migrate this to HBase.
  
 But I think that the approach to generate a file in S3 and then upload  
 to HBase is not the best way to do this; because, you can always
 create the workflow log for every user, build a Put for it and upload it  
 to HBase, and to avoid the locks, I´m valuating to use the asynchronous  
 API released
 by StumbleUpon. [1]
  
 What do you think about this?
  
 [1] https://github.com/stumbleupon/asynchbase
  
   
   
  On May 24, 2012, at 12:19 PM, Marcos Ortizmlor...@uci.cu 
  (mailto:mlor...@uci.cu) wrote:
   
   Thanks a lot for your answer, Amandeep.

   On 05/24/2012 02:55 PM, Amandeep Khurana wrote:
Marcos,
 
You could to a distcp from S3 to HDFS and then do a bulk import into 
HBase.
   The quantity of files are very large, so, we want to combine some files,
   and then construct
   the HFile to upload to HBase.
   Any example of a custom FileMerger for it?
Are you running HBase on EC2 or on your own hardware?

   We have created a small HBase in our own hardware, but we want to build
   another cluster on top of Amazon EC2. This
   could be very good for the integration between S3 and the HBase cluster.

   Regards
-Amandeep
 
 
On Thursday, May 24, 2012 at 11:52 AM, Marcos Ortiz wrote:
 
 Regards to all the list.
 We are using Amazon S3 to store millions of files with certain format,
 and we want to read the content of these files and then upload its
 content to
 a HBase cluster.
 Anyone has done this?
 Can you recommend me a efficient way to do this?
  
 Best wishes.
  
 --
 Marcos Luis Ortíz Valmaseda
 Data Engineer Sr. System Administrator at UCI
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186
  
  
 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
  
 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci
  
 
 
 
10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
 
http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci
 

   --
   Marcos Luis Ortíz Valmaseda
   Data Engineer Sr. System Administrator at UCI
   http://marcosluis2186.posterous.com
   http://www.linkedin.com/in/marcosluis2186
   Twitter: @marcosluis2186


   10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
   INFORMATICAS...
   CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

   http://www.uci.cu
   http://www.facebook.com/universidad.uci
   http://www.flickr.com/photos/universidad_uci

   
  10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
  INFORMATICAS...
  CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
   
  http://www.uci.cu
  http://www.facebook.com/universidad.uci
  

Re: Efficient way to read a large number of files in S3 and upload their content to HBase

2012-05-24 Thread Marcos Ortiz



On 05/24/2012 04:47 PM, Amandeep Khurana wrote:
Thanks for that description. I'm not entirely sure why you want to use 
HBase here. You've got logs coming that you want to process in batch 
to do calculations on. This can be done by running MR jobs on the flat 
files itself. You could use Java MR, Hive or Pig to accomplish this. 
Why do you want HBase here?
Tha main reason to use HBase is for the quantity of rows involved in the 
process. It could provide a efficient and quick way to store all this.

Hive can be an option too.

I will discuss all this again with the dev team.
Thanks a lot for your answers.


-ak

On Thursday, May 24, 2012 at 12:52 PM, Marcos Ortiz wrote:




On 05/24/2012 03:21 PM, Amandeep Khurana wrote:

Marcos

Can you elaborate on your use case a little bit? What is the nature of
data in S3 and why you want to use HBase? Why do you want to combine
HFiles and upload back to S3? It'll help us answer your questions
better.

Amandeep

Ok, let me explain more.
We are working on a ads optimization platform on top of Hadoop and HBase.
Another team of my organization create a type of log file per click 
by user

and store this file in S3. I discussed with them that a better approach
is to storage this
workflow log in HBase, instead S3, because in this way, we can quit
the another step
to read from S3 the content of the file, build the HFile and upload it
to HBase.

The content of the file in S3 is the basic information for the operation:
- Source URL
- User Id
- User agent of the user
- Campaign id
and more fields.

So, we want this to then create MapReduce jobs on top of HBase to some
calculations and reports
for this data.

We are valuating HBase because our current solution is on top of
PostgreSQL, but the main issue is when you
launch a campaign on the platform, the INSERTs and UPDATEs to PostgreSQL
in a short time, could rise from 1 to
100 clicks per second. We did some preliminary tests and in two days,
the table where we store the workflow
log grow exponentially to 350, 000 tuples, so, it could be a problem.
For that reason, we want to migrate this to HBase.

But I think that the approach to generate a file in S3 and then upload
to HBase is not the best way to do this; because, you can always
create the workflow log for every user, build a Put for it and upload it
to HBase, and to avoid the locks, I´m valuating to use the asynchronous
API released
by StumbleUpon. [1]

What do you think about this?

[1] https://github.com/stumbleupon/asynchbase




On May 24, 2012, at 12:19 PM, Marcos Ortizmlor...@uci.cu 
mailto:mlor...@uci.cu wrote:



Thanks a lot for your answer, Amandeep.

On 05/24/2012 02:55 PM, Amandeep Khurana wrote:

Marcos,

You could to a distcp from S3 to HDFS and then do a bulk import 
into HBase.
The quantity of files are very large, so, we want to combine some 
files,

and then construct
the HFile to upload to HBase.
Any example of a custom FileMerger for it?

Are you running HBase on EC2 or on your own hardware?

We have created a small HBase in our own hardware, but we want to build
another cluster on top of Amazon EC2. This
could be very good for the integration between S3 and the HBase 
cluster.


Regards

-Amandeep


On Thursday, May 24, 2012 at 11:52 AM, Marcos Ortiz wrote:


Regards to all the list.
We are using Amazon S3 to store millions of files with certain 
format,

and we want to read the content of these files and then upload its
content to
a HBase cluster.
Anyone has done this?
Can you recommend me a efficient way to do this?

Best wishes.

--
Marcos Luis Ortíz Valmaseda
Data Engineer Sr. System Administrator at UCI
http://marcosluis2186.posterous.com
http://www.linkedin.com/in/marcosluis2186
Twitter: @marcosluis2186


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS 
CIENCIAS INFORMATICAS...

CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...

CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

--
Marcos Luis Ortíz Valmaseda
Data Engineer Sr. System Administrator at UCI
http://marcosluis2186.posterous.com
http://www.linkedin.com/in/marcosluis2186
Twitter: @marcosluis2186


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...

CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci
10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...

CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


--
Marcos Luis Ortíz Valmaseda
Data Engineer Sr. System Administrator at UCI
http://marcosluis2186.posterous.com

Re: Efficient way to read a large number of files in S3 and upload their content to HBase

2012-05-24 Thread Ian Varley
This is a question I see coming up a lot. Put differently: what characteristics 
make it useful to use HBase on top of HDFS, as opposed to just flat files in 
HDFS directly? Quantity isn't really an answer, b/c HDFS does fine with 
quantity (better, even). 

The basic answers are that HBase is good if:

a) You want to be able to read random small bits of data in the middle of 
(large) HDFS files with low latency (i.e. without loading the whole thing from 
disk)
b) You want to be able to modify (insert) random small bits of data in the 
middle of (immutable, sorted) HDFS files without writing the whole thing out 
again each time.

If all you want is a way to quickly store a lot of data, it's hard to beat 
writing to flat files (only /dev/null/ is faster, but it doesn't support 
sharding). :) But if you want to then be able to do either (a) or (b) above, 
that's where you start looking at HBase. I assume in your case, you need 
sub-second access to single records (or ranges of records) anywhere in the set?

Ian

On May 24, 2012, at 1:53 PM, Marcos Ortiz wrote:

 
 
 On 05/24/2012 04:47 PM, Amandeep Khurana wrote:
 Thanks for that description. I'm not entirely sure why you want to use 
 HBase here. You've got logs coming that you want to process in batch 
 to do calculations on. This can be done by running MR jobs on the flat 
 files itself. You could use Java MR, Hive or Pig to accomplish this. 
 Why do you want HBase here?
 Tha main reason to use HBase is for the quantity of rows involved in the 
 process. It could provide a efficient and quick way to store all this.
 Hive can be an option too.
 
 I will discuss all this again with the dev team.
 Thanks a lot for your answers.
 
 -ak
 
 On Thursday, May 24, 2012 at 12:52 PM, Marcos Ortiz wrote:
 
 
 
 On 05/24/2012 03:21 PM, Amandeep Khurana wrote:
 Marcos
 
 Can you elaborate on your use case a little bit? What is the nature of
 data in S3 and why you want to use HBase? Why do you want to combine
 HFiles and upload back to S3? It'll help us answer your questions
 better.
 
 Amandeep
 Ok, let me explain more.
 We are working on a ads optimization platform on top of Hadoop and HBase.
 Another team of my organization create a type of log file per click 
 by user
 and store this file in S3. I discussed with them that a better approach
 is to storage this
 workflow log in HBase, instead S3, because in this way, we can quit
 the another step
 to read from S3 the content of the file, build the HFile and upload it
 to HBase.
 
 The content of the file in S3 is the basic information for the operation:
 - Source URL
 - User Id
 - User agent of the user
 - Campaign id
 and more fields.
 
 So, we want this to then create MapReduce jobs on top of HBase to some
 calculations and reports
 for this data.
 
 We are valuating HBase because our current solution is on top of
 PostgreSQL, but the main issue is when you
 launch a campaign on the platform, the INSERTs and UPDATEs to PostgreSQL
 in a short time, could rise from 1 to
 100 clicks per second. We did some preliminary tests and in two days,
 the table where we store the workflow
 log grow exponentially to 350, 000 tuples, so, it could be a problem.
 For that reason, we want to migrate this to HBase.
 
 But I think that the approach to generate a file in S3 and then upload
 to HBase is not the best way to do this; because, you can always
 create the workflow log for every user, build a Put for it and upload it
 to HBase, and to avoid the locks, I´m valuating to use the asynchronous
 API released
 by StumbleUpon. [1]
 
 What do you think about this?
 
 [1] https://github.com/stumbleupon/asynchbase
 
 
 
 On May 24, 2012, at 12:19 PM, Marcos Ortizmlor...@uci.cu 
 mailto:mlor...@uci.cu wrote:
 
 Thanks a lot for your answer, Amandeep.
 
 On 05/24/2012 02:55 PM, Amandeep Khurana wrote:
 Marcos,
 
 You could to a distcp from S3 to HDFS and then do a bulk import 
 into HBase.
 The quantity of files are very large, so, we want to combine some 
 files,
 and then construct
 the HFile to upload to HBase.
 Any example of a custom FileMerger for it?
 Are you running HBase on EC2 or on your own hardware?
 We have created a small HBase in our own hardware, but we want to build
 another cluster on top of Amazon EC2. This
 could be very good for the integration between S3 and the HBase 
 cluster.
 
 Regards
 -Amandeep
 
 
 On Thursday, May 24, 2012 at 11:52 AM, Marcos Ortiz wrote:
 
 Regards to all the list.
 We are using Amazon S3 to store millions of files with certain 
 format,
 and we want to read the content of these files and then upload its
 content to
 a HBase cluster.
 Anyone has done this?
 Can you recommend me a efficient way to do this?
 
 Best wishes.
 
 --
 Marcos Luis Ortíz Valmaseda
 Data Engineer Sr. System Administrator at UCI
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186
 
 
 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS 
 CIENCIAS 

Re: Efficient way to read a large number of files in S3 and upload their content to HBase

2012-05-24 Thread Marcos Ortiz



On 05/24/2012 05:12 PM, Ian Varley wrote:

This is a question I see coming up a lot. Put differently: what characteristics make it 
useful to use HBase on top of HDFS, as opposed to just flat files in HDFS directly? 
Quantity isn't really an answer, b/c HDFS does fine with quantity (better, 
even).

The basic answers are that HBase is good if:

a) You want to be able to read random small bits of data in the middle of 
(large) HDFS files with low latency (i.e. without loading the whole thing from 
disk)

Yes.

b) You want to be able to modify (insert) random small bits of data in the 
middle of (immutable, sorted) HDFS files without writing the whole thing out 
again each time.

and Yes again.


If all you want is a way to quickly store a lot of data, it's hard to beat 
writing to flat files (only /dev/null/ is faster, but it doesn't support 
sharding). :) But if you want to then be able to do either (a) or (b) above, 
that's where you start looking at HBase. I assume in your case, you need 
sub-second access to single records (or ranges of records) anywhere in the set?

We want to use HBase because we think that's perfect for our requirements.



Ian

On May 24, 2012, at 1:53 PM, Marcos Ortiz wrote:



On 05/24/2012 04:47 PM, Amandeep Khurana wrote:

Thanks for that description. I'm not entirely sure why you want to use
HBase here. You've got logs coming that you want to process in batch
to do calculations on. This can be done by running MR jobs on the flat
files itself. You could use Java MR, Hive or Pig to accomplish this.
Why do you want HBase here?

Tha main reason to use HBase is for the quantity of rows involved in the
process. It could provide a efficient and quick way to store all this.
Hive can be an option too.

I will discuss all this again with the dev team.
Thanks a lot for your answers.

-ak

On Thursday, May 24, 2012 at 12:52 PM, Marcos Ortiz wrote:



On 05/24/2012 03:21 PM, Amandeep Khurana wrote:

Marcos

Can you elaborate on your use case a little bit? What is the nature of
data in S3 and why you want to use HBase? Why do you want to combine
HFiles and upload back to S3? It'll help us answer your questions
better.

Amandeep

Ok, let me explain more.
We are working on a ads optimization platform on top of Hadoop and HBase.
Another team of my organization create a type of log file per click
by user
and store this file in S3. I discussed with them that a better approach
is to storage this
workflow log in HBase, instead S3, because in this way, we can quit
the another step
to read from S3 the content of the file, build the HFile and upload it
to HBase.

The content of the file in S3 is the basic information for the operation:
- Source URL
- User Id
- User agent of the user
- Campaign id
and more fields.

So, we want this to then create MapReduce jobs on top of HBase to some
calculations and reports
for this data.

We are valuating HBase because our current solution is on top of
PostgreSQL, but the main issue is when you
launch a campaign on the platform, the INSERTs and UPDATEs to PostgreSQL
in a short time, could rise from 1 to
100 clicks per second. We did some preliminary tests and in two days,
the table where we store the workflow
log grow exponentially to 350, 000 tuples, so, it could be a problem.
For that reason, we want to migrate this to HBase.

But I think that the approach to generate a file in S3 and then upload
to HBase is not the best way to do this; because, you can always
create the workflow log for every user, build a Put for it and upload it
to HBase, and to avoid the locks, I´m valuating to use the asynchronous
API released
by StumbleUpon. [1]

What do you think about this?

[1] https://github.com/stumbleupon/asynchbase



On May 24, 2012, at 12:19 PM, Marcos Ortizmlor...@uci.cu
mailto:mlor...@uci.cu  wrote:


Thanks a lot for your answer, Amandeep.

On 05/24/2012 02:55 PM, Amandeep Khurana wrote:

Marcos,

You could to a distcp from S3 to HDFS and then do a bulk import
into HBase.

The quantity of files are very large, so, we want to combine some
files,
and then construct
the HFile to upload to HBase.
Any example of a custom FileMerger for it?

Are you running HBase on EC2 or on your own hardware?

We have created a small HBase in our own hardware, but we want to build
another cluster on top of Amazon EC2. This
could be very good for the integration between S3 and the HBase
cluster.

Regards

-Amandeep


On Thursday, May 24, 2012 at 11:52 AM, Marcos Ortiz wrote:


Regards to all the list.
We are using Amazon S3 to store millions of files with certain
format,
and we want to read the content of these files and then upload its
content to
a HBase cluster.
Anyone has done this?
Can you recommend me a efficient way to do this?

Best wishes.

--
Marcos Luis Ortíz Valmaseda
Data Engineer  Sr. System Administrator at UCI
http://marcosluis2186.posterous.com
http://www.linkedin.com/in/marcosluis2186
Twitter: @marcosluis2186


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE