Re: [SQL] Using HashPartitioner to distribute by column

2015-01-21 Thread Michael Davies
Hi Cheng, Are you saying that by setting up the lineage schemaRdd.keyBy(_.getString(1)).partitionBy(new HashPartitioner(n)).values.applySchema(schema) then Spark SQL will know that an SQL “group by” on Customer Code will not have to shuffle? But the prepared will have already shuffled so we

Re: Mapping directory structure to columns in SparkSQL

2015-01-09 Thread Michael Davies
to discuss? Thanks Mick On 30 Dec 2014, at 17:40, Michael Davies michael.belldav...@gmail.com wrote: Hi Michael, I’ve looked through the example and the test cases and I think I understand what we need to do - so I’ll give it a go. I think what I’d like to try to do is allow files

Re: Mapping directory structure to columns in SparkSQL

2014-12-30 Thread Michael Davies
Hi Michael, I’ve looked through the example and the test cases and I think I understand what we need to do - so I’ll give it a go. I think what I’d like to try to do is allow files to be added at anytime, so perhaps I can cache partition info, and also what may be useful for us would be to