Re: Importing Postgres-generated CSVs into Phoenix

James Taylor Fri, 08 Jan 2016 17:33:49 -0800

Sounds good, Ken. If you could comment on SQOOP-2649 that you'd be
interested in using that functionality if it were available, that'd be much
appreciated.


Thanks,
James

On Fri, Jan 8, 2016 at 5:04 PM, Ken Hampson <[email protected]> wrote:

> Hi, James:
>
> Thanks for the reply. I took a look at the patch attached to PHOENIX-2434,
> and it does look like it has a good chance to solve the problem, as
> Postgres exports boolean values as 't' and 'f', which the patch now checks
> for. Once we're able to get to Phoenix 4.7.0, I'll definitely try it out to
> confirm.
>
> We don't and won't be able to use Flyway due to various architectural
> considerations, but we do use Sqoop to get data from Postgres into Hive. I
> had skimmed  SQOOP-2649 when originally looking at how to bulk load into
> Phoenix, and figured CSV was the way to go for now since the  SQOOP-2649
> enhancement was still in patch form. It's definitely something I will keep
> an eye on going forward.
>
> Thanks again,
> - Ken
>
>
> On Fri, Jan 8, 2016 at 1:13 PM James Taylor <[email protected]>
> wrote:
>
>> Hi Ken,
>> PHOENIX-2434 improved our CSV handling of booleans and will appear in our
>> upcoming 4.7.0 release. It'd be good if you can confirm whether or not this
>> is what you need. We definitely want to support ingest of CSVs from other
>> RDBMSs.
>>
>> There are a couple of other avenues of ingest into Phoenix:
>> - using Flyway DB (http://flywaydb.org/) where Phoenix support has been
>> added (not sure if this has appeared in a release yet or not, though)
>> - through a pull request to add support for Phoenix in Scoop (
>> http://sqoop.apache.org/). Unfortunately, this isn't getting much
>> attention in that community (if this is useful for you, you may want to
>> chime in on SQOOP-2649 or on their mailing list).
>>
>> Thanks,
>> James
>>
>> On Thu, Jan 7, 2016 at 5:28 PM, Ken Hampson <[email protected]> wrote:
>>
>>> Hi, all:
>>>
>>> I am in the process of importing some data from Postgres into Phoenix as
>>> part of a larger project. I noticed after doing so and cross-checking the
>>> results of a fairly complex JOIN query (of 6 tables of so) that the number
>>> of rows returned by Postgres and Phoenix differed -- Phoenix had a
>>> significant number more rows. I investigated this further, and ultimately
>>> determined that the reason was that all *boolean *values were imported
>>> as being false, regardless of their actual value.
>>>
>>> When I investigated that aspect further, I discovered that Phoenix
>>> appears to require the full strings "true" and "false" in the CSV in order
>>> to properly register their actual values. As a workaround, I am currently
>>> using a full SELECT in the "\copy" statement in psql to select each column
>>> individually, and for each boolean column, I am using a CASE statement to
>>> emit the full strings. I have since also learned that casting the boolean
>>> to the "text" datatype will also cause Postgres to emit the full string.
>>>
>>> In either case, though, the pain point is that this discrepancy requires
>>> that I examine each table closely for boolean columns, and if there are
>>> any, I then need to enumerate all the columns out explicitly and use either
>>> the CASE statement or the cast method on the boolean columns in order to
>>> get the values Phoenix expects. In my case, I didn't need to transform any
>>> of the data otherwise, so this forced a much more complex procedure.
>>>
>>> While psql provides many options to change various values -- delimiters
>>> and what not -- there is no option to change the boolean values. Likewise,
>>> as near as I can tell, there are no options to change what Phoenix expects,
>>> either.
>>>
>>> I am wondering:
>>>
>>> - Is there any way from within Phoenix to change the boolean values that
>>> it expects in the CSV?
>>> - Are there any other methods from a Phoenix perspective that you can
>>> think of that would obviate the need for the column-by-column enumeration
>>> procedure I outlined above?
>>> - Since Postgres is a major open source RDBMS which I suspect many
>>> people -- myself included -- will want to export data from into Phoenix,
>>> could there be an enhancement to Phoenix to allow an option to specify a
>>> default boolean value, or perhaps an option to specify the CSV source, from
>>> which the proper value could be defined and utilized automatically?
>>>
>>> Thanks,
>>> - Ken
>>>
>>
>>

Re: Importing Postgres-generated CSVs into Phoenix

Reply via email to