Re: bulk loading with dynamic columns

Gabriel Reid Thu, 16 Oct 2014 23:54:47 -0700

Bob,

Going via Scalding sounds like a fine idea as well -- the advantage of
using Pig is that you wouldn't need to implement anything custom in
terms of JDBC handling (because it already exists), but indeed I would
expect that you'll get comparable performance with Scalding.

If you want to generate HFiles, I would indeed look at extending (or
reusing parts of) the CsvBulkLoadTool, which currently creates HFiles.
However, I would definitely only go this way as a fallback if JDBC
performance isn't sufficient.

I actually never really considered a CSV file being dynamic, I was
more thinking along the lines of loading CSV files with different
schemas into the same table (via dynamic columns). If it's at all an
option, I would suggest splitting out records by schema first in a
pre-processing stage, and then loading the collection of files that
match a single schema together. CSV is a fine format for really simple
schemas, but I don't think it would be at all suited to storing
records with different schemas.

- Gabriel

On Fri, Oct 17, 2014 at 8:27 AM, Bob Dole <[email protected]> wrote:
> Gabriel,
>
> Thanks for your response. My current plan is to implement the bulk load
> using scalding via jdbc. I have not played with Pig, but, my guess is my
> scalding solution will achieve comparable performance.
>
> I haven't done a performance test yet, but, if it turns out that loading via
> jdbc is too slow, I would need to generate the HFiles.
>
> I would be interested in your thoughts on how you'd approach generating
> hfiles. Would you extend the csv bulk loader? How would you represent
> dynamic columns in a csv? A general solution is also further complicated by
> the fact that a dynamic column may have heterogeneous types.
>
> -Bob
>
> On Thursday, October 16, 2014 12:24 AM, Gabriel Reid
> <[email protected]> wrote:
>
>
> Hi Bob,
>
> No, there currently isn't any support for bulk loading dynamic columns.
>
> I think that this would (in theory) be as simple as supplying a custom
> upsert statement to the bulk loader or PhoenixHBaseStorage (if you're
> using Pig), so it probably wouldn't be too tricky to implement.
>
> If you're interested in having something like this in Phoenix, could
> you log a ticket for it at
> https://issues.apache.org/jira/browse/PHOENIX? If you're interested in
> taking a crack at implementing it as well, feel free (as well as
> feeling free to ask for advice on how to go about it).
>
> - Gabriel
>
>
> On Thu, Oct 16, 2014 at 7:58 AM, Bob Dole <[email protected]> wrote:
>> Is there any existing support perform bulk loading with dynamic columns?
>>
>> Thanks!
>
>

Re: bulk loading with dynamic columns

Reply via email to