I would love to be told otherwise, but I believe your options are to either
1) use the explode function or 2) pre-process the data so you don't have to
explode it.
On Jul 15, 2017 11:41 AM, "Patrick" wrote:
> Hi,
>
> We need to query deeply nested Json structure. However
While you could do this in Spark it stinks of over-engineering. An ETL tool
would be more appropriate, and if budget is an issue you could look at
alternatives like Pentaho or Talend.
On Thu, Jun 29, 2017 at 8:48 PM, wrote:
> Hi,
>
> One more thing - i am talking about
Redshift COPY is immensely faster than trying to do insert statements. I
did some rough testing of inserting data using INSERT and COPY and COPY is
vastly superior to the point that if speed is at all an issue to your
process you shouldn't even consider using INSERT.
On Mon, Apr 24, 2017 at 11:07
should include in the query.
>
> Thanks
>
> On Tue, Apr 11, 2017 at 2:59 PM Matt Deaver <mattrdea...@gmail.com> wrote:
>
>> Do you have updates coming in on your data flow? If so, you will need a
>> staging table and a merge process into your Teradata tables.
&g
Do you have updates coming in on your data flow? If so, you will need a
staging table and a merge process into your Teradata tables.
If you do not have updated rows aka your Teradata tables are append-only
you can process data and insert (bulk load) into Teradata.
I don't have experience doing
For various reasons, our data set is partitioned in Spark by customer id
and saved to S3. When trying to read this data, however, the larger
partitions make it difficult to parallelize jobs. For example, out of a
couple thousand companies, some have <10 MB data while some have >10GB.
This is the
You have to handle de-duplication upstream or downstream. It might
technically be possible to handle this in Spark but you'll probably have a
better time handling duplicates in the service that reads from Kafka.
On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart
wrote:
>
You could create a one-time job that processes historical data to match the
updated format
On Tue, Mar 21, 2017 at 8:53 AM, Aditya Borde wrote:
> Hello,
>
> I'm currently blocked with this issue:
>
> I have job "A" whose output is partitioned by one of the field - "col1"
>
I have a Spark job that processes incremental data and partitions it by
customer id. Some customers have very little data, and I have another job
that takes a previous period's data and combines it. However, the job runs
serially and I'd basically like to run the function on every partition