[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771188#action_12771188 ] Olga Natkovich commented on PIG-953: +1 on the patch assuming that the all the failures and warning from test-patch are addressed. > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953-2.patch, PIG-953-3.patch, PIG-953-4.patch, > PIG-953-5.patch, PIG-953-6.patch, PIG-953-7.patch, PIG-953-8.patch, > PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771136#action_12771136 ] Hadoop QA commented on PIG-953: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12423385/PIG-953-8.patch against trunk revision 830664. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. -1 javac. The applied patch generated 202 javac compiler warnings (more than the trunk's current 197 warnings). -1 findbugs. The patch appears to introduce 5 new Findbugs warnings. -1 release audit. The applied patch generated 320 release audit warnings (more than the trunk's current 313 warnings). -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/123/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/123/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/123/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/123/console This message is automatically generated. > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953-2.patch, PIG-953-3.patch, PIG-953-4.patch, > PIG-953-5.patch, PIG-953-6.patch, PIG-953-7.patch, PIG-953-8.patch, > PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766622#action_12766622 ] Dmitriy V. Ryaboy commented on PIG-953: --- Pradeep, it seems like PigOutputCommiter should extend OutputCommitter rather than FileOutputCommitter. Also -- add this requirement to the StoreFunc redesign proposal? > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953-2.patch, PIG-953-3.patch, PIG-953-4.patch, > PIG-953-5.patch, PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763256#action_12763256 ] Ashutosh Chauhan commented on PIG-953: -- ..aah.. I should have had dug more in jdk sources. AbstractList , which ArrayList extends does override equals and provides correct behavior. So, my comment is a non-issue. With nits taken care of +1 for the patch. > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953-2.patch, PIG-953-3.patch, PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763218#action_12763218 ] Ashutosh Chauhan commented on PIG-953: -- Changes look good. One comment I have: 1) In SortInfo.java#equals We have two lists and we want to check for their equality. I quickly looked up jdk sources and it seems that ArrayList doesn't override equals, so doing equals check on lists would result in reference equality test which would be incorrect. Correct way to do this would be to first check the sizes of two lists, if they are equal iterate through both lists and check equality of items at the same index in two list. Few nits: 1) TestMergeJoin contains a System.err.println which we can get rid of. 2) There are few unused imports in patch. 3) SortInfo.java#getSortColInfoList may result in Findbugs warning because of similar reason we discussed earlier in this jira. > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953-2.patch, PIG-953-3.patch, PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760837#action_12760837 ] Ashutosh Chauhan commented on PIG-953: -- If we are going to re-write that part later then I guess it may not be worth spending more time perfecting the design here (using SortColInfo instead of parallel arraylists)... patch as it is can be committed because we are going to update that part of the code in any case .. > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953-2.patch, PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760785#action_12760785 ] Dmitriy V. Ryaboy commented on PIG-953: --- Pradeep, I think that the current PigSchema can extend or contain a ResourceSchema (probably the latter as Alan indicated that PigSchema does too many other things to be considered equivalent to a ResourceSchema). Agreed that this rework is not required to get this patch in; I'm ok with the patch as it stands as long as we remember to go back and fix the duplicated functionality later when the Load/Store redesign is implemented. > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953-2.patch, PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760710#action_12760710 ] Pradeep Kamath commented on PIG-953: Dmitriy, I looked at the ResourceSchema proposed in http://wiki.apache.org/pig/LoadStoreRedesignProposal and also spoke with Alan to understand the intent more. The eventual goal is for the setSchema() call in StoreFunc to give the ResourceSchema to the store implementation. The ResourceSchema will contain both pig schema information and sort column information. So Zebra or any other storage function which needs to know about sort columns will get the information from the ResourceSchema passed in setSchema(). However, today there is a way pig runtime conveys the pig schema to store functions (through StoreConfig). We need a separate way to give sort information since pig schema does not have the ability to give it. Since after the rewrite of load/store interfaces this problem will be solved through setSchema(), the solution which we will come up with now in this jira will anyway need to be re-written. So it is cleaner to only keep sort column information in SortColInfo and have an array of SortColInfo in SortInfo. If instead we use ResourceSchema then StoreConfig will have a pig Schema and a Resource Schema which would also be confusing to callers. In short, since this piece code of code will need a re-write later, it is better not to make it generic now and just address immediate needs and the re-write should remove multiple representations of schema/sort information. > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953-2.patch, PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760657#action_12760657 ] Dmitriy V. Ryaboy commented on PIG-953: --- Pradeep, Have you looked at Alan's Load/Store redesign proposal? It has a structure very similar to the one you are describing, used to describe the schema of the loaded resource. I think it makes sense to use that structure throughout, not just at Load time; it could effectively replace SortColumn/SortInfo. > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953-2.patch, PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760423#action_12760423 ] Pradeep Kamath commented on PIG-953: Here is a proposal for dealing with Sort Column information in SortInfo. Rather than giving Arraylist of column names and separate array list of asc/desc flags, it would be good to have a unified structure containing both pieces of information per sort column. Also there are use cases for providing column names (zebra) and for them being optional and providing column positions instead which some other loader /optimizer might find useful. The type of the column might also be useful if available. Hence, the proposal is to have a SortColumn class with the following attributes : column name, column position (zero based index), column type, asc/desc flag. Then in SortInfo there would be a List which would be available through a getter. This should address both the concerns above. Callers will need to explicity check for null column names and UNKNOWN column type since these two scenarios may occur if schema is not available for pig runtime to provide the information. Thoughts? > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953-2.patch, PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760116#action_12760116 ] Ashutosh Chauhan commented on PIG-953: -- Changes look good. Couple of points: bq. I think this internal structure at this point does not need to be optimized for lookup Well, its less about optimization and more about maintainability. First the relationship between two parallel arrays is implicit. So, if someone is reading that code he needs to "understand" that relationship of his own. If there is only one structure relationship would be explicit. Second, there is quite a bit of code around it, which IMO will be simplified if a single data structure is instead used. That said, either approach works just as fine so I will leave it upto you. bq. Zebra needs column names and cannot work with positions That is then the limitation of Zebra which it should overcome someone point in time. There might be a good reason for it, but I fail to see what extra information names of column provides where type and position of columns should be sufficient. This also implies an additional requirement on user. If data is stored using ZebraStorage and if later is loaded back, then user has to provide the same names for columns that he gave while storing it. No such constraint exists for any other load-store like PigStorage. > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953-2.patch, PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754811#action_12754811 ] Ashutosh Chauhan commented on PIG-953: -- And couple more: 8. bq. Findbugs complains about passing internal members as is in getters since the caller can then modifiy these internal members - hence the copy. {code} public List getAscColumns() { return Utils.getCopy(ascColumns); } {code} Instead if we use following, we will achieve the same thing and then neither findbugs will complain, nor their is need for our own copy method. {code} public List getAscColumns() { return new ArrayList(ascColumns); } {code} 9. In POMergeJoin.java {code} // we should never get here! return new Result(POStatus.STATUS_ERR, null); {code} could be changed to {code} // we should never get here! throw new ExecException(errMsg,2176); {code} because if we ever get there, it will result in NPE later on otherwise. > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754671#action_12754671 ] Ashutosh Chauhan commented on PIG-953: -- 1. [Pradeep] zebra store function would basically needs to know the sort keys in order and which of them are asc/dsc. For this they would iterate over our data structure and require that the ordering of the keys match the primary/secondary order of the sort keys [Ashutosh] What about LinkedHashMap? It provides all the properties we are seeking here, one data structure, O(1) lookup and guaranteed iteration order. 2. In Utils.java {code} public static boolean checkNullAndClass(Object obj1, Object obj2) { return checkNullEquals(obj1, obj2, false) && obj1.getClass() == obj2.getClass(); } {code} will result in NPE when both obj1 and obj2 are null. A minor detail: Suppose obj1 is declared of type ArrayList and obj2 is declared of type ArrayList, obj1.getClass() == obj2.getClass() will return true thanks to type erasure by java compiler at compile time. Not sure if thats OK or not for the check here. 3. In StoreConfig.java One of the scenarios in which SortInfo is returned as null is {code} * 3) the store follows an "order by" but the schema * of "order by" does not have column name(s) for the sort * column(s) {code} I understand that reason for this additional constraint is because SortInfo maintains list of column names. But even if schema contains only type information and not the column names, that still is a sufficient information to build indexes. Information about on which column data is sorted on can be recorded using column positions isn't it? Does zebra requires columns to be named? If it doesn't then SortInfo could be changed in such a way that it can provide column position instead of names to loader, if columns arent named. In POMergeJoin.java 4. {code} +currentFileName = lFile.getFileName(); +loader = (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec()); +is = FileLocalizer.open(currentFileName, offset, pc); +if (currentFileName.endsWith(".bz") || currentFileName.endsWith(".bz2")) { +is = new CBZip2InputStream((SeekableInputStream)is, 9); +} else if (currentFileName.endsWith(".gz")) { +is = new GZIPInputStream(is); +} + {code} Isnt this blocked on https://issues.apache.org/jira/browse/PIG-930 ? 5. {code} default: // We don't deal with ERR/NULL. just pass them down return res; {code} should be changed to {code} default: throwProcessingException(false,null); {code} because if status is Error, execution should be stopped and exception should be thrown as early as possible instead of continue doing work which will be wasted. If status is Null NPE will occur while doing join. 6. {code} InputStream is = FileLocalizer.open(rightInputFileName, pc); rightLoader.bindTo(rightInputFileName, new BufferedPositionedInputStream(is), 0, Long.MAX_VALUE); {code} I dont see any use of this code. I think its not required and can be removed. Infact, there is no need of following function too: {code} /** * @param rightInputFileName the rightInputFileName to set */ public void setRightInputFileName(String rightInputFileName) { this.rightInputFileName = rightInputFileName; } {code} file name of right side is obtained from index which is contained in index file. Index file is directly passed as a constructor argument of indexableLoadFunc, so there is no need of passing rightinputfilename from MRCompiler to POMergeJoin. And if this reasoning is correct then DefaultIndexableLoader.bindTo() should throw an IOException, because contract on DefaultIndexableLoader is that it is initialized with all the info it needs in constructor and then seekNear is called on it to seek to correct location. bindTo() shouldn't be used for this loader. Also, seekNear() doesn't sound right. How about seekToClosest() ? 7. I think introducing order preserving flag on logical operator is a good idea. First its self documenting as the information is contained within operator and not checked by doing instanceof else where in code. Second its a useful information which if present can help make optimizer smart decisions. As an example, optimizer can rewrite a symmetric hash join to merge-sort join if all the logical operators in query DAG from join inputs to the root has these flags set to true. Without this flag, doing such optimizations will be hard. > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig >
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753850#action_12753850 ] Olga Natkovich commented on PIG-953: +1 on what Alan said > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753843#action_12753843 ] Alan Gates commented on PIG-953: -1 to adding an orderPreserving flag on operators. We have no intention of ever promising that any relational operator beyond Order and Limit preserve order. The fact that some happen to now (like filter) is a side effect of the current implementation, not a feature. If we add a flag, it becomes a feature that we will be expected to maintain. > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753833#action_12753833 ] Dmitriy V. Ryaboy commented on PIG-953: --- I got my trues and falses reversed on the NPE thing. You are right, the function works as intended. I still think it's too verbose, but agree that it's a style issue -- I guess if the commiters like it, it's fine :-) > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753828#action_12753828 ] Dmitriy V. Ryaboy commented on PIG-953: --- bq. Pradeep: Pig only guarantees order with limit following order - for any other relational operator following order there are no guarantees. Today it is true that filter or a column pruning foreach would also preserve order but this can change if needed in the future. There explicit code to ensure order-limit combination works by preserving order - there is no such explicit check for other operators (keeping it open for change in the future) That actually tells me that an orderPreserving property on a LogicalOperator is a really good idea. That way we can set it to true on all operators that are at the moment order-preserving (limit, filter, column-prining foreach), and not commit to forever maintaining that contract. If filter starts changing order, the patch will simply have to include a change to set orderPreserving to false in POFilter, and everything will work automagically. > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753827#action_12753827 ] Pradeep Kamath commented on PIG-953: I see the first part as a coding style preference - I have done both styles in code myself - don't think it is a major issue with readability with current implementation Can you explain the NPE? If either object is null, the code would return with false unless both are null. If checkEquality is false, the caller should know that only null equality has been checked thus far and if true was returned then the two objects are null and hence equal. My main intent was to have this helper function be used from other Class's equals() implementation so that this mundane check for null need not be repeated in every equals implementation. Maybe I am not understanding your use case better - an example might help. > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753823#action_12753823 ] Dmitriy V. Ryaboy commented on PIG-953: --- Pradeep -- I would argue that my rewrite is not more terse, but less verbose :-) I just don't see how {code} if (boolean) { return true; } else { return false; } {code} is ever better or more readable than {code} return boolean; {code} As far as oddness, I would argue that since the person who wrote the function introduced a possible NPE a few lines lower, someone else can hardly be expected to use it properly :-). See my comment about checkEquality being false. The function as written misses a corner case. > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753816#action_12753816 ] Pradeep Kamath commented on PIG-953: Response to previous comment: Dmitriy: seekNear seems ambiguous, as "near" is a generic concept that does not necessarily imply "before or to, but not after" - which is what this method is required to do. How about "seekBefore()"? Pradeep: I had initially thought of naming this method seekBefore(). However for the case where the key we are using to seek not being present in the right input, the loader can either position at the biggest value before the key OR the smallest value after the key (zebra loader for example can only do the latter). The name "seekBefore" suggests that implementations should always seek before the key in question - hence I chose seekNear. Dmitriy: it looks like the large diff blocks in MergeSort et al are mostly moves of code blocks, not significant code changes, correct? Pradeep: I am assuming you meant POMergeJoin - yes - its mostly code move to DefaultIndexableLoader Dmitriy: Why does getAscColumns and getSortColumns make a copy of the list? Seems like we can save some memory and cpu here. Pradeep: Findbugs complains about passing internal members as is in getters since the caller can then modifiy these internal members - hence the copy. This should not be a performance/memory issue since these copies are 1) on small structures 2) in front end at compile time and not at runtime. Dmitriry: For that matter, why not use a map of (String)colName-> (Boolean)ascending instead of 2 lists? One structure, plus O(1) lookup. Pradeep: This suggestion is reasonable - I picked the current implementation since its inline with how these things are represented today in LOSort. Unless the user of SortInfo does lookup using sort column names, we won't get O(1). I designed this keeping zebra's use case (which is the only use case at this point) and the zebra store function would basically needs to know the sort keys in order and which of them are asc/dsc. For this they would iterate over our datastructure and require that the ordering of the keys match the primary/secondary order of the sort keys - hence a list lends itself better for that. I debated using a List> but thought I can avoid Pair since its a pig implementation class (if tomorrow zebra wants to expose the same SortInfo to its users). Dmitriy: Not sure about the use of super() in the constructor of a class that doesn't extend anything but Object. Is there some magic that requires it? Pradeep: This was unintended - all thanks to eclipse :) - will remove it a next iteration of this patch Dmitriy: In Log2PhysTranslator, why hardcode the Limit operator? There are other operators that don't change sort order, such as filter. Perhaps add a method to Logical Operators that indicates if they alter sort order of their inputs? Pradeep: Pig only guarantees order with limit following order - for any other relational operator following order there are no guarantees. Today it is true that filter or a column pruning foreach would also preserve order but this can change if needed in the future. There explicit code to ensure order-limit combination works by preserving order - there is no such explicit check for other operators (keeping it open for change in the future) Dmitriy: Even with this rewrite, this seems like an odd function. It being as odd as it is leads to it not being used safely when you set checkEquality to false (just a few lines later)-- if obj1 is null and obj2 is not, the func returns true, you try to call a method on obj1, and get an NPE. Pradeep: The rewrite is more terse and is another option as against the explicit if-else I have - more a coding style issue than correctness. The idea of having this function was to serve as a helper for use in Classes which need to implement equals(). It is cumbersone that every new class's equals has to do the equivalent of what you suggest: {code} Util.bothNull(obj1, obj2) || (Util.notNull(obj1, obj2) && obj1.equals(obj2)); {code} This one method can be used either to only check that both objects are not null OR to do that and additionally check equality. Not sure I understand the oddity - am I missing something? Dmitriy: This comment has a typo (and instead of "an"): Pradeep: Will fix in next iteration > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Atta
[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
[ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753801#action_12753801 ] Dmitriy V. Ryaboy commented on PIG-953: --- Pradeep, First, I think this is very important to have, not just for Merge but for other things that might benefit from knowing sort orders, as well. A few minor nits from a cursory glance at the code. I didn't check the actual logic very carefully yet -- it looks like the large diff blocks in MergeSort et al are mostly moves of code blocks, not significant code changes, correct? On to the comments: seekNear seems ambiguous, as "near" is a generic concept that does not necessarily imply "before or to, but not after" -- which is what this method is required to do. How about "seekBefore()"? Why does getAscColumns and getSortColumns make a copy of the list? Seems like we can save some memory and cpu here. For that matter, why not use a map of (String)colName-> (Boolean)ascending instead of 2 lists? One structure, plus O(1) lookup. Not sure about the use of super() in the constructor of a class that doesn't extend anything but Object. Is there some magic that requires it? In Log2PhysTranslator, why hardcode the Limit operator? There are other operators that don't change sort order, such as filter. Perhaps add a method to Logical Operators that indicates if they alter sort order of their inputs? in Utils checkNullEquals is better written as {code} if (obj1 == null || obj2 == null) { return obj1 == obj2; } else { return checkEquality ? obj1.equals(obj2) : true; } {code} Even with this rewrite, this seems like an odd function. It being as odd as it is leads to it not being used safely when you set checkEquality to false (just a few lines later)-- if obj1 is null and obj2 is not, the func returns true, you try to call a method on obj1, and get an NPE. Probably better not to roll all this into one amorphous function and simply write {code} Util.bothNull(obj1, obj2) || (Util.notNull(obj1, obj2) && obj1.equals(obj2)); {code} (the implementations of bothNull and notNull are obvious -- just conjunction and disjunction of obj == null) In StoreConfig This comment has a typo (and instead of "an"): "* 1) the store does not follow and order by" > Enable merge join in pig to work with loaders and store functions which can > internally index sorted data > - > > Key: PIG-953 > URL: https://issues.apache.org/jira/browse/PIG-953 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Pradeep Kamath > Attachments: PIG-953.patch > > > Currently merge join implementation in pig includes construction of an index > on sorted data and use of that index to seek into the "right input" to > efficiently perform the join operation. Some loaders (notably the zebra > loader) internally implement an index on sorted data and can perform this > seek efficiently using their index. So the use of the index needs to be > abstracted in such a way that when the loader supports indexing, pig uses it > (indirectly through the loader) and does not construct an index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.