oh, lame. I can imagine why that would happen-- by design, Shard.shard takes a PCollection<T>, so it's mucking with the PTable keys here when it does the random distribution. We can add a PTable-specific version of Shard.shard w/o too much trouble- would you mind filing a JIRA? https://issues.apache.org/jira/browse/CRUNCH
On Thu, Nov 17, 2016 at 3:48 AM, wu lihu <[email protected]> wrote: > Hi Everyone > I have a job to work with parquet file output, > Shard.shard(outTable,10).write(new > AvroParquetFileTarget(tempOut+path), Target.WriteMode.OVERWRITE); > > However, the output looks like below > 3.0.3.1.2.CH24_RELEASE 2 > 3.0.3.1.2.CH24_RELEASEE 1 > 3.0.3.1.2.CH24_RELEASEEA 1 > 3.0.3.1.2.CH24_RELEASEEAS 1 > 3.0.3.1.2.CH24_RELEASEEASE 29 > 3.0.3.1.2.CH24_RELEASEEASES 160 > 3.0.3.1.2.CH24_RELEASEEASESE 85 > 3.0.3.1.2.CH24_RELEASEEASESEE 14 > 3.0.3.1.2.CH24_RELEASEEASESEEE 4 > 3.0.3.1.2.CH24_RELEASEEASESEEES 1 > there is extra suffix added to the key of the PTable, all of them > should be RELEASE but not the RELEASEEASE bra bra > > If I remove the Shard, and keeps all the same, the output looks like > normal > 3.0.0.1.2.CH.1.4_RELEASE 1 > 3.0.1.1.2.CH22_RELEASE 1622 > 3.0.1.1.2.CH23_RELEASE 10607 > 3.0.14.1.2.CH.1.3_RELEASE 18080 > 3.0.19.1.2.TC21_RELEASE 5 > 3.0.2.1.2.CH11_RELEASE 3 > 3.0.2.1.2.TC21_RELEASE 4 > 3.0.20.1.2.TC21_RELEASE 247 > 3.0.20.7.2.SX.1.2A_RELEASE 2 > 3.0.20.8.2.SX.1.3A_RELEASE 1 > > > Any thoughts ??? >
