Re: data ingestion

Perko, Ralph J Tue, 14 Oct 2014 11:43:53 -0700

Gabriel,

Pig is the way to go!


Thanks for the help.

Ralph

__________________________________________________
Ralph Perko 
Pacific Northwest National Laboratory



On 10/9/14, 11:57 PM, "Gabriel Reid" <[email protected]> wrote:

>Hi Ralph,
>
>Inlined below.
>
>> The data is arriving every 15min in multiple text files.  This is not a
>> real-time system so some waiting is acceptable.  We have a 40 node
>> cluster.
>>
>> I am investigating the Pig option.  I have not used Pig much - it
>>appears
>> the table must pre-exist in Phoenix/Hbase  before loading the data using
>> Pig? Do I need to create two schema¹s - one for Pig and the other for
>> Phoenix?  Are there any other examples of using Pig with Phoenix aside
>> from what is online and in the source tree?
>
>The table must indeed pre-exist in Phoenix before using the Pig
>loader, but I don't think that should be a problem for you (unless you
>want to create a new table for every import).
>
>Pig can simply load from text files and write to Phoenix, so there's
>no need to have a separate Pig schema. Probably the easiest way to
>start is to just write a small Pig script that starts from a text file
>(i.e. doesn't touch Phoenix), and writes to a text file in the form
>that you could use for loading with the CsvBulkLoadTool. Once you're
>up to that point, you can just replace the STORE call in your
>text-to-text Pig script with a STORE call to PhoenixHBaseStorage, and
>things should "just work".
>
>>
>> Writing MR is not an issue, this is my normal mode of ingesting data.
>>Is
>> latency your primary concern to extending the CsvBulkLoadTool?  Would
>> writing directly to the JDBC driver be a better approach in MR?
>
>The main reason I see to not extend the CsvBulkLoadTool is that it's
>more work than using Pig, probably both in the long term and in the
>short term. Depending on how reliable the sources of data are that
>you're dealing with, there may be a need (now or in the future) to
>intercept invalid records and write them elsewhere, or similar things.
>Writing things like that as raw MapReduce, while certainly possible,
>are much easier with Pig. Using the Phoenix Pig plugin essentially
>works the same as writing to the JDBC driver in MR, but means you just
>need to create a Pig script with somewhere around 5 lines of code, as
>opposed to the work involved in writing and packaging a MR ingester.
>
>I typically do quite a bit of ETL stuff with raw MapReduce and/or
>Crunch [1], but for things like what you're describing, I'll always go
>for Pig if possible.
>
>If you find that Pig won't cut it for your requirements, then writing
>via JDBC in a MapReduce job should be fine. Ravi Magham has been
>working on some stuff [2] to make writing to Phoenix via MR easier,
>and although it's not in Phoenix yet, it should provide you with some
>inspiration.
>
>- Gabriel
>
>
>1. http://crunch.apache.org
>2. https://github.com/apache/phoenix/pull/22
>
>>
>> Lots of questions - thanks for your time.
>>
>> Ralph
>>
>> __________________________________________________
>> Ralph Perko
>> Pacific Northwest National Laboratory
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 10/9/14, 11:17 AM, "Gabriel Reid" <[email protected]> wrote:
>>
>>>Hi Ralph,
>>>
>>>I think this depends a bit on how quickly you want to get the data
>>>into Phoenix after it arrives, what kind of infrastructure you've got
>>>available to run MR jobs, and how the data is actually arriving.
>>>
>>>In general, the highest-throughput and least-flexible option is the
>>>CsvBulkLoadTool. Of your transformation requirements above, it'll be
>>>able to take care of #1 and #4, but parsing fields and combining
>>>fields won't be covered by it. The main reason for the relatively high
>>>throughput of the CsvBulkLoadTool is that it writes directly to HFiles
>>>-- however, it is pretty high latency, and it would probably be best
>>>used to do one or two big loads each day instead of every 15 minutes.
>>>
>>>Two other options that are probably worth considering are Pig and
>>>Flume. I believe Pig should be able to provide enough transformation
>>>logic for what you need, and then you can plug it into the Phoenix
>>>StoreFunc for Pig [1]. Although this won't give you the same
>>>throughput as the CsvBulkLoadTool, it'll be more flexible, as well as
>>>probably having slightly lower overall latency because there is no
>>>reduce involved.
>>>
>>>I don't think that Flume itself provides much in the way of
>>>transformation logic, but with Kite Morphlines [2] you can plug in
>>>some transformation logic within Flume, and then send the data through
>>>to the Phoenix Flume plugin [3]. I haven't got much experience with
>>>Flume, but I believe that this should work in theory.
>>>
>>>In any case, I would suggest trying to go with Pig first, and Flume
>>>second. A custom solution will mean you'll need to worry about
>>>scaling/parallelization to get high enough throughput, and both Pig
>>>and Flume are more made for what you're looking for.
>>>
>>>Extending the CsvBulkLoadTool would also be an option, but I would
>>>recommend using that as a last resort (if you can't get high enough
>>>throughput with the other options).
>>>
>>>- Gabriel
>>>
>>>
>>>[1] http://phoenix.apache.org/pig_integration.html
>>>[2] http://kitesdk.org/docs/current/kite-morphlines/index.html
>>>[3] http://phoenix.apache.org/flume.html
>>>
>>>On Thu, Oct 9, 2014 at 4:36 PM, Perko, Ralph J <[email protected]>
>>>wrote:
>>>> Hi,  What is the best way to ingest large amounts of csv data coming
>>>>in
>>>>at
>>>> regular intervals (about every 15min for a total of about 500G/daily
>>>>or
>>>>1.5B
>>>> records/daily) that requires a few transformations before being
>>>>inserted?
>>>>
>>>> By transformation I mean the following:
>>>> 1) 1 field is converted to a timestamp
>>>> 2) 1 field is parsed to create a new field
>>>> 3) several fields are combined into 1
>>>> 4) a couple columns need to be reordered
>>>>
>>>> Is there anyway to make these transformations through the bulk load
>>>>tool or
>>>> is MR the best route?
>>>> If I use MR should I go purely through JDBC? Write directly to hbase?
>>>>Doing
>>>> something similar to the csv bulk load tool (Perhaps even just
>>>>customizing
>>>> the CsvBulkLoadTool?) or something altogether different?
>>>>
>>>> Thanks!
>>>> Ralph
>>>>
>>>> __________________________________________________
>>>> Ralph Perko
>>>> Pacific Northwest National Laboratory
>>>>
>>

Re: data ingestion

Reply via email to