Bob, Going via Scalding sounds like a fine idea as well -- the advantage of using Pig is that you wouldn't need to implement anything custom in terms of JDBC handling (because it already exists), but indeed I would expect that you'll get comparable performance with Scalding.
If you want to generate HFiles, I would indeed look at extending (or reusing parts of) the CsvBulkLoadTool, which currently creates HFiles. However, I would definitely only go this way as a fallback if JDBC performance isn't sufficient. I actually never really considered a CSV file being dynamic, I was more thinking along the lines of loading CSV files with different schemas into the same table (via dynamic columns). If it's at all an option, I would suggest splitting out records by schema first in a pre-processing stage, and then loading the collection of files that match a single schema together. CSV is a fine format for really simple schemas, but I don't think it would be at all suited to storing records with different schemas. - Gabriel On Fri, Oct 17, 2014 at 8:27 AM, Bob Dole <[email protected]> wrote: > Gabriel, > > Thanks for your response. My current plan is to implement the bulk load > using scalding via jdbc. I have not played with Pig, but, my guess is my > scalding solution will achieve comparable performance. > > I haven't done a performance test yet, but, if it turns out that loading via > jdbc is too slow, I would need to generate the HFiles. > > I would be interested in your thoughts on how you'd approach generating > hfiles. Would you extend the csv bulk loader? How would you represent > dynamic columns in a csv? A general solution is also further complicated by > the fact that a dynamic column may have heterogeneous types. > > -Bob > > On Thursday, October 16, 2014 12:24 AM, Gabriel Reid > <[email protected]> wrote: > > > Hi Bob, > > No, there currently isn't any support for bulk loading dynamic columns. > > I think that this would (in theory) be as simple as supplying a custom > upsert statement to the bulk loader or PhoenixHBaseStorage (if you're > using Pig), so it probably wouldn't be too tricky to implement. > > If you're interested in having something like this in Phoenix, could > you log a ticket for it at > https://issues.apache.org/jira/browse/PHOENIX? If you're interested in > taking a crack at implementing it as well, feel free (as well as > feeling free to ask for advice on how to go about it). > > - Gabriel > > > On Thu, Oct 16, 2014 at 7:58 AM, Bob Dole <[email protected]> wrote: >> Is there any existing support perform bulk loading with dynamic columns? >> >> Thanks! > >
