Yes, you're right-- file a JIRA for it? J
On Thu, May 21, 2015 at 10:48 AM, Patel,Stephen <[email protected]> wrote: > I was looking at the PCollectionImpl.by method[0] today, and I think > that the ExtractKeyFn[1] it's using may not be calculating scaleFactor > correctly. The ExtractKeyFn is using the default scaleFactor for a MapFn > (1.0), but shouldn't it have a scaleFactor of 1 + the input MapFn's > scaleFactor? > > As an example, if you had a Pcollection<T> and you call by with the > IdentifyFn, the returned table should have a size of 2 * the original > collections size, but as it stands now, it will have the same size as the > original. > > Assuming we later group a table that we constructed with by, won't we > use (potentially) far fewer reducers than we actually should be? > > [0]: > https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/impl/dist/collect/PCollectionImpl.java#L270 > [1]: > https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/fn/ExtractKeyFn.java > CONFIDENTIALITY NOTICE This message and any included attachments are > from Cerner Corporation and are intended only for the addressee. The > information contained in this message is confidential and may constitute > inside or non-public information under international, federal, or state > securities laws. Unauthorized forwarding, printing, copying, distribution, > or use of such information is strictly prohibited and may be unlawful. If > you are not the addressee, please promptly delete this message and notify > the sender of the delivery error by e-mail or you may call Cerner's > corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024. >
