JIRA filed, see: https://issues.apache.org/jira/browse/PIG-2014
--jacob @thedatachef On Mon, 2011-04-25 at 09:02 -0700, Alan Gates wrote: > You are not insane. Pig rewrites sample into filter, and then pushes > that filter in front of the group. It shouldn't push that filter > since the UDF is non-deterministic. If you add "-t PushUpFilter" to > your command line when invoking pig this won't happen. Could you file > a JIRA for this so we keep track of it? > > Alan. > > On Apr 24, 2011, at 10:41 AM, Jacob Perkins wrote: > > > So I'm running into something strange. Consider the following code: > > > > tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, > > weight:double); > > grouped = GROUP tfidf_all BY doc_id; > > vectors = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, > > weight) AS vector; > > DUMP vectors; > > > > This, of course, runs just fine. tfidf_all contains 1,428,280 records. > > The reduce output records should be exactly the number of documents, > > which turn out to be 18,863 in this case. All well and good. > > > > The strangeness comes when I add a SAMPLE command: > > > > sampled = SAMPLE vectors 0.0012; > > DUMP sampled; > > > > Running this results in 1,513 reduce output records. So, am I insane > > or > > shouldn't the reduce output records be much much closer to 22 or 23 > > records (eg. 0.0012*18863)? > > > > --jacob > > @thedatachef > > >
