I've done this in two passes. First I do an intersection test and determine the outer misses by join key on each side, similar to what you've done. I then store the left_only and right_only side for further inspection.
Then I take the intersection relation, which contains a left and right tuple and I pass that through a UDF. This is similar to your #3 proposal, only the UDF takes two tuples. It traverses them in parallel before outputting a string representation of a bitmask of which tuple field matched or missed. Group on the bitmasks to generate counts and you get a report of all the different combos of field misses. All without a known schema. On Fri, Nov 30, 2012 at 12:49 PM, Ruslan Al-Fakikh <[email protected]>wrote: > Hi, > > As for point 1: it will always be cumbersome to work on such files. I would > recommend using Avro where the schema is included in the file. > Also you could try to sort contents or apply some transformation to force > the files look the same. Then just diff the files outside of Pig, that's > just an idea, I'm not sure whether it'll work for you. > > Thanks > > > On Fri, Nov 30, 2012 at 5:48 PM, Clément MATHIEU <[email protected] > >wrote: > > > Hi all, > > > > I'm trying to build a non regression testing tool to verify that the > files > > produced by two Pig scripts are equals. > > > > The files are in PigStorage format. The first field is a key and > remaining > > fields are opaque data (primitive or complex types). > > > > Example: > > 1 43 {(10), (12), (14)} {(55), (90)} 0 > 60 > > > > I want to check that each key is present in both or neither files, and > > that > > for each key the lines are equals. By being equals I mean logical > equality > > not string or byte equality. For example, the two following lines should > be > > equal: > > 1 43 {(10), (12), (14)} {(55), (90)} 0 > 60 > > 1 43 {(12), (10), (14)} {(90), (55)} 0 > 60 > > > > > > My issue is that since this tool needs to operate on lot of different > > files, it should not rely on a predefined schema. I experimented > > the following idea: > > > > ------ > > f1 = LOAD '$FILE1' USING PigStorage(); > > f2 = LOAD '$FILE2' USING PigStorage(); > > > > g_f1 = GROUP f1 BY $0; > > g_f2 = GROUP f2 BY $0; > > > > joined = JOIN > > g_f1 by group full outer, > > g_f2 by group; > > > > cmp = FILTER joined by > > g_f1::group is null > > or g_f2::group is null > > or SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0; > > > > dump cmp; > > ------ > > > > Unfortunately, since no schema is specified at load time, g_f1::f1 and > > g_f2::f2 are instance of DataByteArray. It means that the DIFF function > > does > > not behave as wanted. A byte-to-byte comparison is performed rather than > a > > logical comparison. For example "1 {(2),(1)}" and "1 > {(1),(2)}" > > are different since their byte representations are not the same. > > > > Do you know if a such tool already exist or how to write it ? > > > > I currently foresee three options: > > > > 1- Specify the schema. It could be done using scripting and a > > file-to-schema > > mapping. The schema would be inserted using a variable. However the > > schema > > of each file has to be described manually. This is a cumbersome > > process. > > 2- Use PigStorageSchema instead of PigStorage. I believe this would > solve > > the issue; but being stuck with 0.8.1 I'm wondering if > > PigStorageSchema > > is reasonably robust and side effect free to be used in production > > scripts. > > 3- Write a custom DIFF UDF taking two DataByteArray. This option allows > > to not > > modify production scripts but I don't know how much effort is > required > > to write a such UDF. Parsing the DataByteArray to rebuild a > > set/list/string structure seems quite easy. Do you think some part > of > > Pig code like Utf8StorageConverter can be reused or should I simply > > write > > my own parser ? > > > > > > Thanks ! > > > > - Clément > > > > > > > -- *Note that I'm no longer using my Yahoo! email address. Please email me at [email protected] going forward.*
