from:"Colin Williams"

Self contained Spark application with local master without spark-submit

2022-01-19 Thread Colin Williams

Hello,

I noticed I can run spark applications with a local master via sbt run
and also via the IDE. I'd like to run a single threaded worker
application as a self contained jar.

What does sbt run employ that allows it to run a local master?

Can I build an uber jar and run without spark-submit?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Corrupt record handling in spark structured streaming and from_json function

2018-12-31 Thread Colin Williams

Dear spark user community,

I have recieved some insight regarding filtering seperate dataframes
in my spark-structured-streaming job. However I wish to write the
dataframes aforementioned above in the stack overflow question each
using a parquet writer to a separate location. My initial impression
is this requires multiple sinks, but I'm being pressured against that.
I think it might also be possible using the for each / for each batch
writers. But I'm not sure regarding parquet writer, and also the
caveats to this approach. Can some more advanced users or developers
suggest how to go about this, particularly without using multiple
streams?


On Wed, Dec 26, 2018 at 6:01 PM Colin Williams
 wrote:
>
> https://stackoverflow.com/questions/53938967/writing-corrupt-data-from-kafka-json-datasource-in-spark-structured-streaming
>
> On Wed, Dec 26, 2018 at 2:42 PM Colin Williams
>  wrote:
> >
> > From my initial impression it looks like I'd need to create my own
> > `from_json` using `jsonToStructs` as a reference but try to handle `
> > case : BadRecordException => null ` or similar to try to write the non
> > matching string to a corrupt records column
> >
> > On Wed, Dec 26, 2018 at 1:55 PM Colin Williams
> >  wrote:
> > >
> > > Hi,
> > >
> > > I'm trying to figure out how I can write records that don't match a
> > > json read schema via spark structred streaming to an output sink /
> > > parquet location. Previously I did this in batch via corrupt column
> > > features of batch. But in this spark structured streaming I'm reading
> > > from kafka a string and using from_json on the value of that string.
> > > If it doesn't match my schema then I from_json returns null for all
> > > the rows, and does not populate a corrupt record column. But I want to
> > > somehow obtain the source kafka string in a dataframe, and an write to
> > > a output sink / parquet location.
> > >
> > > def getKafkaEventDataFrame(rawKafkaDataFrame: DataFrame, schema: 
> > > StructType) = {
> > >   val jsonDataFrame = 
> > > rawKafkaDataFrame.select(col("value").cast("string"))
> > >   jsonDataFrame.select(from_json(col("value"),
> > > schema)).select("jsontostructs(value).*")
> > > }

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Corrupt record handling in spark structured streaming and from_json function

2018-12-26 Thread Colin Williams

https://stackoverflow.com/questions/53938967/writing-corrupt-data-from-kafka-json-datasource-in-spark-structured-streaming

On Wed, Dec 26, 2018 at 2:42 PM Colin Williams
 wrote:
>
> From my initial impression it looks like I'd need to create my own
> `from_json` using `jsonToStructs` as a reference but try to handle `
> case : BadRecordException => null ` or similar to try to write the non
> matching string to a corrupt records column
>
> On Wed, Dec 26, 2018 at 1:55 PM Colin Williams
>  wrote:
> >
> > Hi,
> >
> > I'm trying to figure out how I can write records that don't match a
> > json read schema via spark structred streaming to an output sink /
> > parquet location. Previously I did this in batch via corrupt column
> > features of batch. But in this spark structured streaming I'm reading
> > from kafka a string and using from_json on the value of that string.
> > If it doesn't match my schema then I from_json returns null for all
> > the rows, and does not populate a corrupt record column. But I want to
> > somehow obtain the source kafka string in a dataframe, and an write to
> > a output sink / parquet location.
> >
> > def getKafkaEventDataFrame(rawKafkaDataFrame: DataFrame, schema: 
> > StructType) = {
> >   val jsonDataFrame = rawKafkaDataFrame.select(col("value").cast("string"))
> >   jsonDataFrame.select(from_json(col("value"),
> > schema)).select("jsontostructs(value).*")
> > }

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Corrupt record handling in spark structured streaming and from_json function

2018-12-26 Thread Colin Williams

>From my initial impression it looks like I'd need to create my own
`from_json` using `jsonToStructs` as a reference but try to handle `
case : BadRecordException => null ` or similar to try to write the non
matching string to a corrupt records column

On Wed, Dec 26, 2018 at 1:55 PM Colin Williams
 wrote:
>
> Hi,
>
> I'm trying to figure out how I can write records that don't match a
> json read schema via spark structred streaming to an output sink /
> parquet location. Previously I did this in batch via corrupt column
> features of batch. But in this spark structured streaming I'm reading
> from kafka a string and using from_json on the value of that string.
> If it doesn't match my schema then I from_json returns null for all
> the rows, and does not populate a corrupt record column. But I want to
> somehow obtain the source kafka string in a dataframe, and an write to
> a output sink / parquet location.
>
> def getKafkaEventDataFrame(rawKafkaDataFrame: DataFrame, schema: StructType) 
> = {
>   val jsonDataFrame = rawKafkaDataFrame.select(col("value").cast("string"))
>   jsonDataFrame.select(from_json(col("value"),
> schema)).select("jsontostructs(value).*")
> }

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Corrupt record handling in spark structured streaming and from_json function

2018-12-26 Thread Colin Williams

Hi,

I'm trying to figure out how I can write records that don't match a
json read schema via spark structred streaming to an output sink /
parquet location. Previously I did this in batch via corrupt column
features of batch. But in this spark structured streaming I'm reading
from kafka a string and using from_json on the value of that string.
If it doesn't match my schema then I from_json returns null for all
the rows, and does not populate a corrupt record column. But I want to
somehow obtain the source kafka string in a dataframe, and an write to
a output sink / parquet location.

def getKafkaEventDataFrame(rawKafkaDataFrame: DataFrame, schema: StructType) = {
  val jsonDataFrame = rawKafkaDataFrame.select(col("value").cast("string"))
  jsonDataFrame.select(from_json(col("value"),
schema)).select("jsontostructs(value).*")
}

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Packaging kafka certificates in uber jar

2018-12-26 Thread Colin Williams

Hi thanks. This is part of the solution I found after writing the
question. The other part being is that I needed to write the input
stream to a temporary file. I would prefer not to write any temporary
file but the  ssl.keystore.location properties seems to expect a file
path.

On Tue, Dec 25, 2018 at 5:26 AM Anastasios Zouzias  wrote:
>
> Hi Colin,
>
> You can place your certificates under src/main/resources and include them in 
> the uber JAR, see e.g. : 
> https://stackoverflow.com/questions/40252652/access-files-in-resources-directory-in-jar-from-apache-spark-streaming-context
>
> Best,
> Anastasios
>
> On Mon, Dec 24, 2018 at 10:29 PM Colin Williams 
>  wrote:
>>
>> I've been trying to read from kafka via a spark streaming client. I
>> found out spark cluster doesn't have certificates deployed. Then I
>> tried using the same local certificates I've been testing with by
>> packing them in an uber jar and getting a File handle from the
>> Classloader resource. But I'm getting a File Not Found exception.
>> These are jks certificates. Is anybody aware how to package
>> certificates in a jar with a kafka client preferably the spark one?
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
>
> --
> -- Anastasios Zouzias

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Packaging kafka certificates in uber jar

2018-12-24 Thread Colin Williams

I've been trying to read from kafka via a spark streaming client. I
found out spark cluster doesn't have certificates deployed. Then I
tried using the same local certificates I've been testing with by
packing them in an uber jar and getting a File handle from the
Classloader resource. But I'm getting a File Not Found exception.
These are jks certificates. Is anybody aware how to package
certificates in a jar with a kafka client preferably the spark one?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Casting nested columns and updated nested struct fields.

2018-11-23 Thread Colin Williams

Looks like it's been reported already. It's too bad it's been a year
but should be released into spark 3:
https://issues.apache.org/jira/browse/SPARK-22231
On Fri, Nov 23, 2018 at 8:42 AM Colin Williams
 wrote:
>
> Seems like it's worthy of filing a bug against withColumn
>
> On Wed, Nov 21, 2018, 6:25 PM Colin Williams 
> >
>> Hello,
>>
>> I'm currently trying to update the schema for a dataframe with nested
>> columns. I would either like to update the schema itself or cast the
>> column without having to explicitly select all the columns just to
>> cast one.
>>
>> In regards to updating the schema it looks like I would probably need
>> to write a more complex map on the schema to find the StructFields I
>> want  to update and update them. I haven't found any examples of this
>> but it seems like there should be a simpler way to do it.
>>
>> In regards to changing the column on the dataframe itself, using E.G.
>>
>> val newDF = 
>> df.withColumn("existing.top.level.FIELD_NAME",df.col("existing.top.level.FIELD_NAME").cast(LongType))
>>
>> I end up with a new column named "existing.top.level.FIELD_NAME" at
>> the root level vs updating the nested column to the new type. Then has
>> anybody worked out how to both update nested column datatype and also
>> how to update the column type from the nested schema StructType? Are
>> there any easy ways to do this or is there a reason it is not trivial?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Casting nested columns and updated nested struct fields.

2018-11-23 Thread Colin Williams

Seems like it's worthy of filing a bug against withColumn

On Wed, Nov 21, 2018, 6:25 PM Colin Williams <
colin.williams.seat...@gmail.com wrote:

> Hello,
>
> I'm currently trying to update the schema for a dataframe with nested
> columns. I would either like to update the schema itself or cast the
> column without having to explicitly select all the columns just to
> cast one.
>
> In regards to updating the schema it looks like I would probably need
> to write a more complex map on the schema to find the StructFields I
> want  to update and update them. I haven't found any examples of this
> but it seems like there should be a simpler way to do it.
>
> In regards to changing the column on the dataframe itself, using E.G.
>
> val newDF =
> df.withColumn("existing.top.level.FIELD_NAME",df.col("existing.top.level.FIELD_NAME").cast(LongType))
>
> I end up with a new column named "existing.top.level.FIELD_NAME" at
> the root level vs updating the nested column to the new type. Then has
> anybody worked out how to both update nested column datatype and also
> how to update the column type from the nested schema StructType? Are
> there any easy ways to do this or is there a reason it is not trivial?
>

Casting nested columns and updated nested struct fields.

2018-11-21 Thread Colin Williams

Hello,

I'm currently trying to update the schema for a dataframe with nested
columns. I would either like to update the schema itself or cast the
column without having to explicitly select all the columns just to
cast one.

In regards to updating the schema it looks like I would probably need
to write a more complex map on the schema to find the StructFields I
want  to update and update them. I haven't found any examples of this
but it seems like there should be a simpler way to do it.

In regards to changing the column on the dataframe itself, using E.G.

val newDF = 
df.withColumn("existing.top.level.FIELD_NAME",df.col("existing.top.level.FIELD_NAME").cast(LongType))

I end up with a new column named "existing.top.level.FIELD_NAME" at
the root level vs updating the nested column to the new type. Then has
anybody worked out how to both update nested column datatype and also
how to update the column type from the nested schema StructType? Are
there any easy ways to do this or is there a reason it is not trivial?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

inferred schemas for spark streaming from a Kafka source

2018-11-13 Thread Colin Williams

Does anybody know how to use inferred schemas with structured
streaming: 
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#schema-inference-and-partition-of-streaming-dataframesdatasets

I have some code like :

object StreamingApp {

  def launch(config: Config, spark: SparkSession): Unit = {
import spark.implicits._


val schemaJson = spark.sparkContext.parallelize(List(config.schema))
val schemaDF = spark.read.json(schemaJson)
schemaDF.printSchema()

// read text from kafka
val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers",config.broker)
  .option("subscribe",config.topic)
  .option("startingOffsets", "earliest")
  .load()

spark.sql("set spark.sql.streaming.schemaInference=true")

val jsonOptions = Map[String,String]("mode" -> "FAILFAST")

val org_store_event_df = df.select(
  col("key").cast("string"),
  from_json(col("value").cast("string"), schemaDF.schema,
jsonOptions)).writeStream
  .format("console")
  .start()
  .awaitTermination()
  }
}

I'd like to compare an inferred schema against my provided, to
determine what I'm missing from my provided scheme or why I arrive
with all nulls in my values column.

currently I'm using a schema to read from a json file. But I'd like to
infer the schema from the stream as suggested by the docs. Then not
sure how to replace from_json so that the value column is read using
an inferred schema, or otherwise.

Maybe it's not supported for kafka streams and only for file streams?
If this is the case then why the have different implementations?

Also shouldn't we make the documentation more clear?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Dataframe reader does not read microseconds, but TimestampType supports microseconds

2018-07-02 Thread Colin Williams

I'm confused as to why Sparks Dataframe reader does not support reading
json or similar with microsecond timestamps to microseconds, but instead
reads into millis.

This seems strange when the TimestampType supports microseconds.

For example create a schema for a json object with a column of
TimestampType. Then read data from that column with timestamps with
microseconds like

2018-05-13 20:25:34.153712

2018-05-13T20:25:37.348006

You will end up with timestamps with millisecond precision.

E.G. 2018-05-13 20:25:34.153



When reading about TimestampType: The data type representing
java.sql.Timestamp values. Please use the singleton DataTypes.TimestampType.


java.sql.timestamp provides a method that reads timestamps like
Timestamp.valueOf("2018-05-13 20:25:37.348006") including milliseconds.

So why does Spark's DataFrame reader drop the ball on this?

Specifying a custom Partitioner on RDD creation in Spark 2

2018-04-10 Thread Colin Williams

Hi,

I'm currently creating RDDs using a pattern like follows:

val rdd: RDD[String] = session.sparkContext.parallelize(longkeys).flatMap(
  key => {
logInfo(s"job at key: ${key}")
Source.fromBytes(S3Util.getBytes(S3Util.getClient(region,
S3Util.getCredentialsProvider("INSTANCE", "")), bucket, key))
  .getLines()
  }
  )

We've been using this pattern or similar to workaround issues
regarding S3 and our hadoop version. However, this same pattern could
might be applied to other types of data sources, which may not have a
connector.

This method has been working out fairly well, but I'd like more
control regarding how the data is partitioned from the start.



I tried to manually partition the data without a partitioner, but got
JVM exceptions regarding my Arrays being to large for the JVM.

val keyList = groupedKeys.keys.toList
val rdd: RDD[String] =
session.sparkContext.parallelize(keyList,keyList.length).flatMap {
  key =>
logInfo(s"job at day: ${key}")
val byteArrayBuffer = new ArrayBuffer[Byte]()
val objectKeyList: List[String] = groupedKeys(key)
objectKeyList.foreach(
  objectKey => {
logInfo(s"working on object: ${objectKey}")
byteArrayBuffer.appendAll(S3Util.getBytes(S3Util.getClient(region,
S3Util.getCredentialsProvider("INSTANCE", "")), bucket, objectKey))
  }
)
Source.fromBytes(byteArrayBuffer.toArray[Byte]).getLines()
}



Then I've defined a custom partitioner based on my source data:

  class dayPartitioner(keys: List[String]) extends Partitioner with Logger {

val keyMap: Map[String, List[String]] = keys.groupBy(_.substring(0, 10))
val partitions = keyMap.keySet.size
val partitionMap: Map[String, Int] = keyMap.keys.zipWithIndex.toMap

override def getPartition(key: Any): Int = {
  val keyString = key.asInstanceOf[String]
  val partitionKey = keyString.substring(0, 10)
  partitionMap(partitionKey)
}

override def numPartitions: Int = partitions
  }

}


I'd like to know do I have to create a custom RDD class to specify my
RDD and use it like in the pattern above?

If so I'd also like a reference regarding doing this, to hopefully
save me some headaches and gotchas from a naive approach. I've found
one such example https://stackoverflow.com/a/25204589 but it's from an
older version of Spark.

I'm hoping maybe there is something more recent and more in-depth. I
don't mind references to books or otherwise.


Best,

Colin Williams

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark-sql importing schemas from catalogString or schema.toString()

2018-03-28 Thread Colin Williams

val test_schema = DataType.fromJson(schema).asInstanceOf[StructType]
val session = SparkHelper.getSparkSession
val df1: DataFrame = session.read
  .format("json")
  .schema(test_schema)
  .option("inferSchema","false")
  .option("mode","FAILFAST")
  .load("src/test/resources/*.gz")
df1.show(80)

On Wed, Mar 28, 2018 at 5:10 PM, Colin Williams
<colin.williams.seat...@gmail.com> wrote:
> I've had more success exporting the schema toJson and importing that.
> Something like:
>
>
> val df1: DataFrame = session.read
>   .format("json")
>   .schema(test_schema)
>   .option("inferSchema","false")
>   .option("mode","FAILFAST")
>   .load("src/test/resources/*.gz")
> df1.show(80)
>
>
>
> On Wed, Mar 28, 2018 at 3:25 PM, Colin Williams
> <colin.williams.seat...@gmail.com> wrote:
>> The to String representation look like where "someName" is unique:
>>
>>  StructType(StructField("someName",StringType,true),
>> StructField("someName",StructType(StructField("someName",StructType(StructField("someName",StringType,true),
>> StructField("someName",StringType,true)),true),
>> StructField("someName",StructType(StructField("someName",StringType,true),
>> StructField("someName",StringType,true)),true),
>> StructField("someName",StructType(StructField("someName",StringType,true)),true),
>>  StructField("someName",StructType(StructField("someName",StringType,true),
>> StructField("someName",StringType,true)),true),
>> StructField("someName",StructType(StructField("someName",StringType,true),
>>  StructField("someName",StringType,true)),true),
>> StructField("someName",StructType(StructField("someName",StringType,true),
>> StructField("someName",StringType,true)),true),
>>  StructField("someName",StructType(StructField("someName",StringType,true),
>> StructField("someName",StringType,true)),true),
>> StructField("someName",StructType(StructField("someName",StringType,true),
>>  StructField("someName",StringType,true)),true),
>> StructField("someName",StructType(StructField("someName",StringType,true),
>> StructField("someName",StringType,true)),true),
>>  StructField("someName",StructType(StructField("someName",StringType,true),
>> StructField("someName",StringType,true)),true),
>> StructField("someName",StructType(StructField("someName",StringType,true),
>>  StructField("someName",StringType,true)),true),
>> StructField("someName",StructType(StructField("someName",StringType,true),
>> StructField("someName",StringType,true)),true),
>> StructField("someName",
>> StructType(StructField("someName",StringType,true),
>>  StructField("someName",StringType,true)),true),
>> StructField("someName",StructType(StructField("someName",StringType,true),
>> StructField("someName",StringType,true)),true),
>> StructField("someName",
>> StructType(StructField("someName",StringType,true),
>> StructField("someName",StringType,true)),true),
>> StructField("someName",StructType(StructField("someName",StringType,true),
>> StructField("someName",StringType,  true)),true),
>>  StructField("someName",StructType(StructField("someName",StringType,true),
>> StructField("someName",StringType,true)),true),
>> StructField("someName",StructType(StructField("someName",StringType,true),
>>  StructField("someName",StringType,true)),true),
>> StructField("someName",StructType(StructField("someName",StringType,true),
>> StructField("someName",StringType,true)),true),
>> StructField("someName",
>> StructType(StructField("someName",StringType,true),
>>  StructField("someName",StringType,true)),true),
>> StructField("someName",StructType(StructField("someName",StringType,true),
>> StructField("someName",StringType,true)),true),
>> StructField("someName",
>> StructType(StructField("someName",StringType,true),
>> StructField("someName",StringType,true)),true),
>> Struct

Re: spark-sql importing schemas from catalogString or schema.toString()

2018-03-28 Thread Colin Williams

I've had more success exporting the schema toJson and importing that.
Something like:


val df1: DataFrame = session.read
  .format("json")
  .schema(test_schema)
  .option("inferSchema","false")
  .option("mode","FAILFAST")
  .load("src/test/resources/*.gz")
df1.show(80)



On Wed, Mar 28, 2018 at 3:25 PM, Colin Williams
<colin.williams.seat...@gmail.com> wrote:
> The to String representation look like where "someName" is unique:
>
>  StructType(StructField("someName",StringType,true),
> StructField("someName",StructType(StructField("someName",StructType(StructField("someName",StringType,true),
> StructField("someName",StringType,true)),true),
> StructField("someName",StructType(StructField("someName",StringType,true),
> StructField("someName",StringType,true)),true),
> StructField("someName",StructType(StructField("someName",StringType,true)),true),
>  StructField("someName",StructType(StructField("someName",StringType,true),
> StructField("someName",StringType,true)),true),
> StructField("someName",StructType(StructField("someName",StringType,true),
>  StructField("someName",StringType,true)),true),
> StructField("someName",StructType(StructField("someName",StringType,true),
> StructField("someName",StringType,true)),true),
>  StructField("someName",StructType(StructField("someName",StringType,true),
> StructField("someName",StringType,true)),true),
> StructField("someName",StructType(StructField("someName",StringType,true),
>  StructField("someName",StringType,true)),true),
> StructField("someName",StructType(StructField("someName",StringType,true),
> StructField("someName",StringType,true)),true),
>  StructField("someName",StructType(StructField("someName",StringType,true),
> StructField("someName",StringType,true)),true),
> StructField("someName",StructType(StructField("someName",StringType,true),
>  StructField("someName",StringType,true)),true),
> StructField("someName",StructType(StructField("someName",StringType,true),
> StructField("someName",StringType,true)),true),
> StructField("someName",
> StructType(StructField("someName",StringType,true),
>  StructField("someName",StringType,true)),true),
> StructField("someName",StructType(StructField("someName",StringType,true),
> StructField("someName",StringType,true)),true),
> StructField("someName",
> StructType(StructField("someName",StringType,true),
> StructField("someName",StringType,true)),true),
> StructField("someName",StructType(StructField("someName",StringType,true),
> StructField("someName",StringType,  true)),true),
>  StructField("someName",StructType(StructField("someName",StringType,true),
> StructField("someName",StringType,true)),true),
> StructField("someName",StructType(StructField("someName",StringType,true),
>  StructField("someName",StringType,true)),true),
> StructField("someName",StructType(StructField("someName",StringType,true),
> StructField("someName",StringType,true)),true),
> StructField("someName",
> StructType(StructField("someName",StringType,true),
>  StructField("someName",StringType,true)),true),
> StructField("someName",StructType(StructField("someName",StringType,true),
> StructField("someName",StringType,true)),true),
> StructField("someName",
> StructType(StructField("someName",StringType,true),
> StructField("someName",StringType,true)),true),
> StructField("someName",StructType(StructField("someName",StringType,true),
> StructField("someName",StringType,  true)),true)),true),
>  StructField("someName",BooleanType,true),
> StructField("someName",LongType,true),
> StructField("someName",StringType,true),
> StructField("someName",StringType,true),
> StructField("someName",StringType,true),
> StructField("someName",StringType,true))
>
>
> The catalogString looks something like where SOME_TABLE_NAME is unique:
>
> struct<action:string,SOME_TABLE_NAME:struct<SOME_TABLE_NAME:struct<newValue:string,SOME_TABLE_NAME:string>,
> 
&g

Re: spark-sql importing schemas from catalogString or schema.toString()

2018-03-28 Thread Colin Williams

The to String representation look like where "someName" is unique:

 StructType(StructField("someName",StringType,true),
StructField("someName",StructType(StructField("someName",StructType(StructField("someName",StringType,true),
StructField("someName",StringType,true)),true),
StructField("someName",StructType(StructField("someName",StringType,true),
StructField("someName",StringType,true)),true),
StructField("someName",StructType(StructField("someName",StringType,true)),true),
 StructField("someName",StructType(StructField("someName",StringType,true),
StructField("someName",StringType,true)),true),
StructField("someName",StructType(StructField("someName",StringType,true),
 StructField("someName",StringType,true)),true),
StructField("someName",StructType(StructField("someName",StringType,true),
StructField("someName",StringType,true)),true),
 StructField("someName",StructType(StructField("someName",StringType,true),
StructField("someName",StringType,true)),true),
StructField("someName",StructType(StructField("someName",StringType,true),
 StructField("someName",StringType,true)),true),
StructField("someName",StructType(StructField("someName",StringType,true),
StructField("someName",StringType,true)),true),
 StructField("someName",StructType(StructField("someName",StringType,true),
StructField("someName",StringType,true)),true),
StructField("someName",StructType(StructField("someName",StringType,true),
 StructField("someName",StringType,true)),true),
StructField("someName",StructType(StructField("someName",StringType,true),
StructField("someName",StringType,true)),true),
StructField("someName",
StructType(StructField("someName",StringType,true),
 StructField("someName",StringType,true)),true),
StructField("someName",StructType(StructField("someName",StringType,true),
StructField("someName",StringType,true)),true),
StructField("someName",
StructType(StructField("someName",StringType,true),
StructField("someName",StringType,true)),true),
StructField("someName",StructType(StructField("someName",StringType,true),
StructField("someName",StringType,  true)),true),
 StructField("someName",StructType(StructField("someName",StringType,true),
StructField("someName",StringType,true)),true),
StructField("someName",StructType(StructField("someName",StringType,true),
 StructField("someName",StringType,true)),true),
StructField("someName",StructType(StructField("someName",StringType,true),
StructField("someName",StringType,true)),true),
StructField("someName",
StructType(StructField("someName",StringType,true),
 StructField("someName",StringType,true)),true),
StructField("someName",StructType(StructField("someName",StringType,true),
StructField("someName",StringType,true)),true),
StructField("someName",
StructType(StructField("someName",StringType,true),
StructField("someName",StringType,true)),true),
StructField("someName",StructType(StructField("someName",StringType,true),
StructField("someName",StringType,  true)),true)),true),
 StructField("someName",BooleanType,true),
StructField("someName",LongType,true),
StructField("someName",StringType,true),
StructField("someName",StringType,true),
StructField("someName",StringType,true),
StructField("someName",StringType,true))


The catalogString looks something like where SOME_TABLE_NAME is unique:

struct<action:string,SOME_TABLE_NAME:struct<SOME_TABLE_NAME:struct<newValue:string,SOME_TABLE_NAME:string>,

SOME_TABLE_NAME:struct<newValue:string,SOME_TABLE_NAME:string>,SOME_TABLE_NAME:struct,
SOME_TABLE_NAME:struct<newValue:string,SOME_TABLE_NAME:string>,SOME_TABLE_NAME:struct<newValue:string,
 
SOME_TABLE_NAME:string>,SOME_TABLE_NAME:struct<newValue:string,SOME_TABLE_NAME:string>,SOME_TABLE_NAME:
struct<newValue:string,SOME_TABLE_NAME:string>,SOME_TABLE_NAME:struct<newValue:string,SOME_TABLE_NAME:
 
string>,SOME_TABLE_NAME:struct<newValue:string,SOME_TABLE_NAME:string>,SOME_TABLE_NAME:struct,SOME_TABLE_NAME:struct<newValue:string,SOME_TABLE_NAME:string>,
 
SOME_TABLE_NAME:struct<newValue:string,SOME_TABLE_NAME:string>,SOME_TABLE_NAME:struct<newValue:string,
 
SOME_TABLE_NAME:string>,SOME_TAB

spark-sql importing schemas from catalogString or schema.toString()

2018-03-28 Thread Colin Williams

I've been learning spark-sql and have been trying to export and import
some of the generated schemas to edit them. I've been writing the
schemas to strings like df1.schema.toString() and
df.schema.catalogString

But I've been having trouble loading the schemas created. Does anyone
know if it's possible to work with the catalogString? I couldn't find
too many resources working with it. Is it possible to create a schema
from this string and load from it using the SparkSession?

Similarly I haven't yet sucessfully loaded the toString Schema, after
some small edits...


There's a little tidbit about some of this here:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-DataType.html

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Self contained Spark application with local master without spark-submit

Re: Corrupt record handling in spark structured streaming and from_json function

Re: Corrupt record handling in spark structured streaming and from_json function

Re: Corrupt record handling in spark structured streaming and from_json function

Corrupt record handling in spark structured streaming and from_json function

Re: Packaging kafka certificates in uber jar

Packaging kafka certificates in uber jar

Re: Casting nested columns and updated nested struct fields.

Re: Casting nested columns and updated nested struct fields.

Casting nested columns and updated nested struct fields.

inferred schemas for spark streaming from a Kafka source

Dataframe reader does not read microseconds, but TimestampType supports microseconds

Specifying a custom Partitioner on RDD creation in Spark 2

Re: spark-sql importing schemas from catalogString or schema.toString()

Re: spark-sql importing schemas from catalogString or schema.toString()

Re: spark-sql importing schemas from catalogString or schema.toString()

spark-sql importing schemas from catalogString or schema.toString()

17 matches

Site Navigation

Mail list logo

Footer information