Hi Paul, Unfortunately out of the box the Spark integration doesn't support saving to dynamic columns. It's worth filing a JIRA enhancement over, and if you're interested in contributing a patch, here's the following spots I think would need enhancing:
The saving code derives the column names to use with Phoenix from the DataFrame itself here [1] as `fieldArray`. We would likely need a new DataFrame parameter to pass in the column list (with dynamic columns included) here [2] [1] https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala#L32-L35 [2] https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DefaultSource.scala#L38 The output configuration, which takes care of getting the MapReduce bits ready for saving, would also need to be updated to support the dynamic column definitions here [3], and then the 'UPSERT' statement construction would need to be adjusted to support those as well here [4] [3] https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/ConfigurationUtil.scala#L25-L38 [4] https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/util/PhoenixConfigurationUtil.java#L259 Thanks, Josh On Mon, Jul 25, 2016 at 5:49 PM, Paul Jones <pajo...@adobe.com> wrote: > Is it possible to save a dataframe into a table where the columns are > dynamic? > > For instance, I have a loaded a CSV file with header (key, cat1, cat2) > into a dataframe. All values are strings. I created a table like this: > create table mytable ("KEY" varchar not null primary key); The code is as > follows: > > val df = sqlContext.read > .format("com.databricks.spark.csv") > .option("header", "true") > .option("inferSchema", "true") > .option("delimiter", "\t") > .load("saint.tsv") > > df.write > .format("org.apache.phoenix.spark") > .mode("overwrite") > .option("table", "mytable") > .option("zkUrl", "servier:2181/hbase") > .save() > > The CSV files I process always have a key column but I don’t know what the > other columns will be until I start processing. The code above fails my > example unless I create static columns named cat1 and cat2. Can I change > the save somehow to run an upsert specifying the names/column types thus > saving into dynamic columns? > > Thanks in advance, > Paul > >