I was looking at the PCollectionImpl.by method[0] today, and I think that the 
ExtractKeyFn[1] it's using may not be calculating scaleFactor correctly.  The 
ExtractKeyFn is using the default scaleFactor for a MapFn (1.0), but shouldn't 
it have a scaleFactor of 1 + the input MapFn's scaleFactor?

As an example, if you had a Pcollection<T> and you call by with the IdentifyFn, 
the returned table should have a size of 2 * the original collections size, but 
as it stands now, it will have the same size as the original.

Assuming we later group a table that we constructed with by, won't we use 
(potentially) far fewer reducers than we actually should be?

[0]: 
https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/impl/dist/collect/PCollectionImpl.java#L270
[1]: 
https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/fn/ExtractKeyFn.java

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.

Reply via email to