Gabriel, Pig is the way to go!
Thanks for the help. Ralph __________________________________________________ Ralph Perko Pacific Northwest National Laboratory On 10/9/14, 11:57 PM, "Gabriel Reid" <[email protected]> wrote: >Hi Ralph, > >Inlined below. > >> The data is arriving every 15min in multiple text files. This is not a >> real-time system so some waiting is acceptable. We have a 40 node >> cluster. >> >> I am investigating the Pig option. I have not used Pig much - it >>appears >> the table must pre-exist in Phoenix/Hbase before loading the data using >> Pig? Do I need to create two schema¹s - one for Pig and the other for >> Phoenix? Are there any other examples of using Pig with Phoenix aside >> from what is online and in the source tree? > >The table must indeed pre-exist in Phoenix before using the Pig >loader, but I don't think that should be a problem for you (unless you >want to create a new table for every import). > >Pig can simply load from text files and write to Phoenix, so there's >no need to have a separate Pig schema. Probably the easiest way to >start is to just write a small Pig script that starts from a text file >(i.e. doesn't touch Phoenix), and writes to a text file in the form >that you could use for loading with the CsvBulkLoadTool. Once you're >up to that point, you can just replace the STORE call in your >text-to-text Pig script with a STORE call to PhoenixHBaseStorage, and >things should "just work". > >> >> Writing MR is not an issue, this is my normal mode of ingesting data. >>Is >> latency your primary concern to extending the CsvBulkLoadTool? Would >> writing directly to the JDBC driver be a better approach in MR? > >The main reason I see to not extend the CsvBulkLoadTool is that it's >more work than using Pig, probably both in the long term and in the >short term. Depending on how reliable the sources of data are that >you're dealing with, there may be a need (now or in the future) to >intercept invalid records and write them elsewhere, or similar things. >Writing things like that as raw MapReduce, while certainly possible, >are much easier with Pig. Using the Phoenix Pig plugin essentially >works the same as writing to the JDBC driver in MR, but means you just >need to create a Pig script with somewhere around 5 lines of code, as >opposed to the work involved in writing and packaging a MR ingester. > >I typically do quite a bit of ETL stuff with raw MapReduce and/or >Crunch [1], but for things like what you're describing, I'll always go >for Pig if possible. > >If you find that Pig won't cut it for your requirements, then writing >via JDBC in a MapReduce job should be fine. Ravi Magham has been >working on some stuff [2] to make writing to Phoenix via MR easier, >and although it's not in Phoenix yet, it should provide you with some >inspiration. > >- Gabriel > > >1. http://crunch.apache.org >2. https://github.com/apache/phoenix/pull/22 > >> >> Lots of questions - thanks for your time. >> >> Ralph >> >> __________________________________________________ >> Ralph Perko >> Pacific Northwest National Laboratory >> >> >> >> >> >> >> >> >> >> On 10/9/14, 11:17 AM, "Gabriel Reid" <[email protected]> wrote: >> >>>Hi Ralph, >>> >>>I think this depends a bit on how quickly you want to get the data >>>into Phoenix after it arrives, what kind of infrastructure you've got >>>available to run MR jobs, and how the data is actually arriving. >>> >>>In general, the highest-throughput and least-flexible option is the >>>CsvBulkLoadTool. Of your transformation requirements above, it'll be >>>able to take care of #1 and #4, but parsing fields and combining >>>fields won't be covered by it. The main reason for the relatively high >>>throughput of the CsvBulkLoadTool is that it writes directly to HFiles >>>-- however, it is pretty high latency, and it would probably be best >>>used to do one or two big loads each day instead of every 15 minutes. >>> >>>Two other options that are probably worth considering are Pig and >>>Flume. I believe Pig should be able to provide enough transformation >>>logic for what you need, and then you can plug it into the Phoenix >>>StoreFunc for Pig [1]. Although this won't give you the same >>>throughput as the CsvBulkLoadTool, it'll be more flexible, as well as >>>probably having slightly lower overall latency because there is no >>>reduce involved. >>> >>>I don't think that Flume itself provides much in the way of >>>transformation logic, but with Kite Morphlines [2] you can plug in >>>some transformation logic within Flume, and then send the data through >>>to the Phoenix Flume plugin [3]. I haven't got much experience with >>>Flume, but I believe that this should work in theory. >>> >>>In any case, I would suggest trying to go with Pig first, and Flume >>>second. A custom solution will mean you'll need to worry about >>>scaling/parallelization to get high enough throughput, and both Pig >>>and Flume are more made for what you're looking for. >>> >>>Extending the CsvBulkLoadTool would also be an option, but I would >>>recommend using that as a last resort (if you can't get high enough >>>throughput with the other options). >>> >>>- Gabriel >>> >>> >>>[1] http://phoenix.apache.org/pig_integration.html >>>[2] http://kitesdk.org/docs/current/kite-morphlines/index.html >>>[3] http://phoenix.apache.org/flume.html >>> >>>On Thu, Oct 9, 2014 at 4:36 PM, Perko, Ralph J <[email protected]> >>>wrote: >>>> Hi, What is the best way to ingest large amounts of csv data coming >>>>in >>>>at >>>> regular intervals (about every 15min for a total of about 500G/daily >>>>or >>>>1.5B >>>> records/daily) that requires a few transformations before being >>>>inserted? >>>> >>>> By transformation I mean the following: >>>> 1) 1 field is converted to a timestamp >>>> 2) 1 field is parsed to create a new field >>>> 3) several fields are combined into 1 >>>> 4) a couple columns need to be reordered >>>> >>>> Is there anyway to make these transformations through the bulk load >>>>tool or >>>> is MR the best route? >>>> If I use MR should I go purely through JDBC? Write directly to hbase? >>>>Doing >>>> something similar to the csv bulk load tool (Perhaps even just >>>>customizing >>>> the CsvBulkLoadTool?) or something altogether different? >>>> >>>> Thanks! >>>> Ralph >>>> >>>> __________________________________________________ >>>> Ralph Perko >>>> Pacific Northwest National Laboratory >>>> >>
