Sounds good, Ken. If you could comment on SQOOP-2649 that you'd be interested in using that functionality if it were available, that'd be much appreciated.
Thanks, James On Fri, Jan 8, 2016 at 5:04 PM, Ken Hampson <[email protected]> wrote: > Hi, James: > > Thanks for the reply. I took a look at the patch attached to PHOENIX-2434, > and it does look like it has a good chance to solve the problem, as > Postgres exports boolean values as 't' and 'f', which the patch now checks > for. Once we're able to get to Phoenix 4.7.0, I'll definitely try it out to > confirm. > > We don't and won't be able to use Flyway due to various architectural > considerations, but we do use Sqoop to get data from Postgres into Hive. I > had skimmed SQOOP-2649 when originally looking at how to bulk load into > Phoenix, and figured CSV was the way to go for now since the SQOOP-2649 > enhancement was still in patch form. It's definitely something I will keep > an eye on going forward. > > Thanks again, > - Ken > > > On Fri, Jan 8, 2016 at 1:13 PM James Taylor <[email protected]> > wrote: > >> Hi Ken, >> PHOENIX-2434 improved our CSV handling of booleans and will appear in our >> upcoming 4.7.0 release. It'd be good if you can confirm whether or not this >> is what you need. We definitely want to support ingest of CSVs from other >> RDBMSs. >> >> There are a couple of other avenues of ingest into Phoenix: >> - using Flyway DB (http://flywaydb.org/) where Phoenix support has been >> added (not sure if this has appeared in a release yet or not, though) >> - through a pull request to add support for Phoenix in Scoop ( >> http://sqoop.apache.org/). Unfortunately, this isn't getting much >> attention in that community (if this is useful for you, you may want to >> chime in on SQOOP-2649 or on their mailing list). >> >> Thanks, >> James >> >> On Thu, Jan 7, 2016 at 5:28 PM, Ken Hampson <[email protected]> wrote: >> >>> Hi, all: >>> >>> I am in the process of importing some data from Postgres into Phoenix as >>> part of a larger project. I noticed after doing so and cross-checking the >>> results of a fairly complex JOIN query (of 6 tables of so) that the number >>> of rows returned by Postgres and Phoenix differed -- Phoenix had a >>> significant number more rows. I investigated this further, and ultimately >>> determined that the reason was that all *boolean *values were imported >>> as being false, regardless of their actual value. >>> >>> When I investigated that aspect further, I discovered that Phoenix >>> appears to require the full strings "true" and "false" in the CSV in order >>> to properly register their actual values. As a workaround, I am currently >>> using a full SELECT in the "\copy" statement in psql to select each column >>> individually, and for each boolean column, I am using a CASE statement to >>> emit the full strings. I have since also learned that casting the boolean >>> to the "text" datatype will also cause Postgres to emit the full string. >>> >>> In either case, though, the pain point is that this discrepancy requires >>> that I examine each table closely for boolean columns, and if there are >>> any, I then need to enumerate all the columns out explicitly and use either >>> the CASE statement or the cast method on the boolean columns in order to >>> get the values Phoenix expects. In my case, I didn't need to transform any >>> of the data otherwise, so this forced a much more complex procedure. >>> >>> While psql provides many options to change various values -- delimiters >>> and what not -- there is no option to change the boolean values. Likewise, >>> as near as I can tell, there are no options to change what Phoenix expects, >>> either. >>> >>> I am wondering: >>> >>> - Is there any way from within Phoenix to change the boolean values that >>> it expects in the CSV? >>> - Are there any other methods from a Phoenix perspective that you can >>> think of that would obviate the need for the column-by-column enumeration >>> procedure I outlined above? >>> - Since Postgres is a major open source RDBMS which I suspect many >>> people -- myself included -- will want to export data from into Phoenix, >>> could there be an enhancement to Phoenix to allow an option to specify a >>> default boolean value, or perhaps an option to specify the CSV source, from >>> which the proper value could be defined and utilized automatically? >>> >>> Thanks, >>> - Ken >>> >> >>
