To follow on:
I asked the developer how we incrementally load data and the response was

no. union only for updated records (every night)
For every minutes export algorithm next:
1. upload file to hadoop.
2. load data inpath... overwrite into table ...._incremental;
3. insert into table ..._cached from ..._incremental

Perhaps this helps understand our issue

On Thursday, November 20, 2014, Gordon Benjamin <gordon.benjami...@gmail.com>
wrote:

> Hi,
>
> We are seeing bad performance as we incrementally load data. Here is the
> config
>
> Spark standalone cluster
>
> spark01 (spark master, shark, hadoop namenode): 15GB RAM, 4vCPU's
> spark02 (spark worker, hadoop datanode): 15GB RAM, 8vCPU's
> spark03 (spark worker): 15GB RAM, 8vCPU's
> spark04 (spark worker): 15GB RAM, 8vCPU's
>
> spark worker configuration:
> spark.local.dir=/path/to/ssd/disk
> spark.default.parallelism=64
> spark.executor.memory=10g
> spark.serializer=org.apache.spark.serializer.KryoSerializer
>
> shark configuration:
> spark.kryoserializer.buffer.mb=64
> mapred.reduce.tasks=30
> spark.scheduler.mode=FAIR
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> spark.default.parallelism=64
>
> and the performance decreases with more data being loaded into spark
>
> simple query like this:
> select count(*) from customers_cached
> 0.5 second on 12th Nov
> 4.24 seconds now
>
> We have these errors all over the log
>
> 2014-11-20 16:56:42,125 WARN  parse.TypeCheckProcFactory
> (TypeCheckProcFactory.java:convert(180)) - Invalid type entry TOK_INT=null
> 2014-11-20 16:56:51,988 WARN  parse.TypeCheckProcFactory
> (TypeCheckProcFactory.java:convert(180)) - Invalid type entry
> TOK_TABLE_OR_COL=null
>
> Anyone any ideas to help us resolve this? Can post up anything you need
>
>
>
>
>

Reply via email to