Re: Querying on Deeply Nested JSON Structures

2017-07-15 Thread Matt Deaver
I would love to be told otherwise, but I believe your options are to either 1) use the explode function or 2) pre-process the data so you don't have to explode it. On Jul 15, 2017 11:41 AM, "Patrick" wrote: > Hi, > > We need to query deeply nested Json structure. However

Re: Python Spark for full fledged ETL

2017-06-29 Thread Matt Deaver
While you could do this in Spark it stinks of over-engineering. An ETL tool would be more appropriate, and if budget is an issue you could look at alternatives like Pentaho or Talend. On Thu, Jun 29, 2017 at 8:48 PM, wrote: > Hi, > > One more thing - i am talking about

Re: community feedback on RedShift with Spark

2017-04-24 Thread Matt Deaver
Redshift COPY is immensely faster than trying to do insert statements. I did some rough testing of inserting data using INSERT and COPY and COPY is vastly superior to the point that if speed is at all an issue to your process you shouldn't even consider using INSERT. On Mon, Apr 24, 2017 at 11:07

Re: [Spark-SQL] : Incremental load in Pyspark

2017-04-11 Thread Matt Deaver
should include in the query. > > Thanks > > On Tue, Apr 11, 2017 at 2:59 PM Matt Deaver <mattrdea...@gmail.com> wrote: > >> Do you have updates coming in on your data flow? If so, you will need a >> staging table and a merge process into your Teradata tables. &g

Re: [Spark-SQL] : Incremental load in Pyspark

2017-04-11 Thread Matt Deaver
Do you have updates coming in on your data flow? If so, you will need a staging table and a merge process into your Teradata tables. If you do not have updated rows aka your Teradata tables are append-only you can process data and insert (bulk load) into Teradata. I don't have experience doing

Best way to deal with skewed partition sizes

2017-03-22 Thread Matt Deaver
For various reasons, our data set is partitioned in Spark by customer id and saved to S3. When trying to read this data, however, the larger partitions make it difficult to parallelize jobs. For example, out of a couple thousand companies, some have <10 MB data while some have >10GB. This is the

Re: Spark streaming to kafka exactly once

2017-03-22 Thread Matt Deaver
You have to handle de-duplication upstream or downstream. It might technically be possible to handle this in Spark but you'll probably have a better time handling duplicates in the service that reads from Kafka. On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart wrote: >

Re: Merging Schema while reading Parquet files

2017-03-21 Thread Matt Deaver
You could create a one-time job that processes historical data to match the updated format On Tue, Mar 21, 2017 at 8:53 AM, Aditya Borde wrote: > Hello, > > I'm currently blocked with this issue: > > I have job "A" whose output is partitioned by one of the field - "col1" >

Recombining output files in parallel

2017-03-20 Thread Matt Deaver
I have a Spark job that processes incremental data and partitions it by customer id. Some customers have very little data, and I have another job that takes a previous period's data and combines it. However, the job runs serially and I'd basically like to run the function on every partition