[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-10-28 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771188#action_12771188
 ] 

Olga Natkovich commented on PIG-953:


+1 on the patch assuming that the all the failures and warning from test-patch 
are addressed.

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953-2.patch, PIG-953-3.patch, PIG-953-4.patch, 
> PIG-953-5.patch, PIG-953-6.patch, PIG-953-7.patch, PIG-953-8.patch, 
> PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-10-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771136#action_12771136
 ] 

Hadoop QA commented on PIG-953:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12423385/PIG-953-8.patch
  against trunk revision 830664.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated 1 warning messages.

-1 javac.  The applied patch generated 202 javac compiler warnings (more 
than the trunk's current 197 warnings).

-1 findbugs.  The patch appears to introduce 5 new Findbugs warnings.

-1 release audit.  The applied patch generated 320 release audit warnings 
(more than the trunk's current 313 warnings).

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/123/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/123/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/123/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/123/console

This message is automatically generated.

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953-2.patch, PIG-953-3.patch, PIG-953-4.patch, 
> PIG-953-5.patch, PIG-953-6.patch, PIG-953-7.patch, PIG-953-8.patch, 
> PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-10-16 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766622#action_12766622
 ] 

Dmitriy V. Ryaboy commented on PIG-953:
---

Pradeep, it seems like PigOutputCommiter should extend OutputCommitter rather 
than FileOutputCommitter. 
Also -- add this requirement to the StoreFunc redesign proposal?



> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953-2.patch, PIG-953-3.patch, PIG-953-4.patch, 
> PIG-953-5.patch, PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-10-07 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763256#action_12763256
 ] 

Ashutosh Chauhan commented on PIG-953:
--

..aah.. I should have had dug more in jdk sources. AbstractList , which 
ArrayList extends does override equals and provides correct behavior. So, my 
comment is a non-issue. With nits taken care of +1 for the patch.

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953-2.patch, PIG-953-3.patch, PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-10-07 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763218#action_12763218
 ] 

Ashutosh Chauhan commented on PIG-953:
--

Changes look good. One comment I have:

1) In SortInfo.java#equals
We have two lists and we want to check for their equality. I quickly looked up 
jdk sources and it seems that ArrayList doesn't override equals, so doing 
equals check on lists would result in reference equality test which would be 
incorrect. Correct way to do this would be to first check the sizes of two 
lists, if they are equal iterate through both lists and check equality of items 
at the same index in two list.  

Few nits:
1) TestMergeJoin contains a System.err.println which we can get rid of.
2) There are few unused imports in patch.
3) SortInfo.java#getSortColInfoList may result in Findbugs warning because of 
similar reason we discussed earlier in this jira. 

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953-2.patch, PIG-953-3.patch, PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-29 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760837#action_12760837
 ] 

Ashutosh Chauhan commented on PIG-953:
--

If we are going to re-write that part later then I guess it may not be worth 
spending more time perfecting the design here (using SortColInfo instead of 
parallel arraylists)... patch as it is can be committed because we are going to 
update that part of the code in any case .. 

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953-2.patch, PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-29 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760785#action_12760785
 ] 

Dmitriy V. Ryaboy commented on PIG-953:
---

Pradeep,

I think that the current PigSchema can extend or contain a ResourceSchema 
(probably the latter as Alan indicated that PigSchema does too many other 
things to be considered equivalent to a ResourceSchema).  

Agreed that this rework is not required to get this patch in; I'm ok with the 
patch as it stands as long as we remember to go back and fix the duplicated 
functionality later when the Load/Store redesign is implemented.

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953-2.patch, PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-29 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760710#action_12760710
 ] 

Pradeep Kamath commented on PIG-953:


Dmitriy,
  I looked at the ResourceSchema proposed in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal and also spoke with Alan 
to understand the intent more. The eventual goal is for the setSchema() call in 
StoreFunc to give the ResourceSchema to the store implementation. The 
ResourceSchema will contain both pig schema information and sort column 
information. So Zebra or any other storage function which needs to know about 
sort columns will get the information from the ResourceSchema passed in 
setSchema().

However, today there is a way pig runtime conveys the pig schema to store 
functions (through StoreConfig). We need a separate way to give sort 
information since pig schema does not have the ability to give it. Since after 
the rewrite of load/store interfaces this problem will be solved through 
setSchema(), the solution which we will come up with now in this jira will 
anyway need to be re-written. So it is cleaner to only keep sort column 
information in SortColInfo and have an array of SortColInfo in SortInfo. If 
instead we use ResourceSchema then StoreConfig will have a pig Schema and a 
Resource Schema which would also be confusing to callers. 

In short, since this piece code of code will need a re-write later, it is 
better not to make it generic now and just address immediate needs and the 
re-write should remove multiple representations of schema/sort information.

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953-2.patch, PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-29 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760657#action_12760657
 ] 

Dmitriy V. Ryaboy commented on PIG-953:
---

Pradeep,
Have you looked at Alan's Load/Store redesign proposal? It has a structure very 
similar to the one you are describing, used to describe the schema of the 
loaded resource.
I think it makes sense to use that structure throughout, not just at Load time; 
it could effectively replace SortColumn/SortInfo.



> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953-2.patch, PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-28 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760423#action_12760423
 ] 

Pradeep Kamath commented on PIG-953:


Here is a proposal for dealing with Sort Column information in SortInfo. Rather 
than giving Arraylist of column names and separate array list of asc/desc 
flags, it would be good to have a unified structure containing both pieces of 
information per sort column. Also there are use cases for providing column 
names (zebra) and for them being optional and providing column positions 
instead which some other loader /optimizer might find useful. The type of the 
column might also be useful if available. Hence, the proposal is to have a 
SortColumn class with the following attributes : column name, column position 
(zero based index), column type, asc/desc flag. Then in SortInfo there would be 
a List which would be available through a getter. This should 
address both the concerns above. Callers will need to explicity check for null 
column names and UNKNOWN column type since these two scenarios may occur if 
schema is not available for pig runtime to provide the information.

Thoughts?

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953-2.patch, PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-27 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760116#action_12760116
 ] 

Ashutosh Chauhan commented on PIG-953:
--

Changes look good. Couple of points:

bq. I think this internal structure at this point does not need to be optimized 
for lookup

Well, its less about optimization and more about maintainability. First the 
relationship between two parallel arrays is implicit. So, if someone is reading 
that code he needs to "understand" that relationship of his own. If there is 
only one structure relationship would be explicit. Second, there is quite a bit 
of  code around it, which IMO will be simplified if a single data structure is 
instead used. That said, either approach works just as fine so I will leave it 
upto you. 

bq. Zebra needs column names and cannot work with positions

That is then the limitation of Zebra which it should overcome someone point in 
time. There might be a good reason for it, but I fail to see what extra 
information names of column provides where type and position of columns should 
be sufficient. This also implies an additional requirement on user. If data is 
stored using ZebraStorage and if later is loaded back, then user has to provide 
the same names for columns that he gave while storing it. No such constraint 
exists for any other load-store like PigStorage.

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953-2.patch, PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-13 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754811#action_12754811
 ] 

Ashutosh Chauhan commented on PIG-953:
--

And couple more:
8.
bq. Findbugs complains about passing internal members as is in getters since 
the caller can then modifiy these internal members - hence the copy.

{code}
public List getAscColumns() {
return Utils.getCopy(ascColumns);
}
{code}

Instead if we use following, we will achieve the same thing and then neither 
findbugs will complain, nor their is need for our own copy method.
{code}
public List getAscColumns() {
return new ArrayList(ascColumns);
}
{code}

9. In POMergeJoin.java
{code}
// we should never get here!
return new Result(POStatus.STATUS_ERR, null);
{code}

could be changed to
{code}
// we should never get here!
throw new ExecException(errMsg,2176);
{code}
because if we ever get there, it will result in NPE later on otherwise.

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-13 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754671#action_12754671
 ] 

Ashutosh Chauhan commented on PIG-953:
--

1. [Pradeep] zebra store function would basically needs to know the sort keys 
in order and which of them are asc/dsc. For this they would iterate over our 
data structure and require that the ordering of the keys match the 
primary/secondary order of the sort keys

[Ashutosh] What about LinkedHashMap? It provides all the properties we are 
seeking here, one data structure, O(1) lookup and guaranteed iteration order. 

2. In Utils.java
{code}
public static boolean checkNullAndClass(Object obj1, Object obj2) {
return checkNullEquals(obj1, obj2, false) && obj1.getClass() == 
obj2.getClass();
}
{code}

will result in NPE when both obj1 and obj2 are null. 

A minor detail:  Suppose obj1 is declared of type ArrayList and obj2 
is declared of type ArrayList, obj1.getClass() == obj2.getClass() will 
return true thanks to type erasure by java compiler at compile time. Not sure 
if thats OK or not for the check here. 

3. In StoreConfig.java One of the scenarios in which SortInfo is returned as 
null is
{code}
* 3) the store follows an "order by" but the schema
* of "order by" does not have column name(s) for the sort
* column(s)
{code}
I understand that reason for this additional constraint is because SortInfo 
maintains list of column names. But even if schema contains only type 
information and not the column names, that still is a sufficient information to 
build indexes. Information about on which column data is sorted on can be 
recorded using column positions isn't it? Does zebra requires columns to be 
named? If it doesn't then SortInfo could be changed in such a way that it can 
provide column position instead of names to loader, if columns arent named.

In POMergeJoin.java
4.
{code}
+currentFileName = lFile.getFileName();
+loader = 
(LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
+is = FileLocalizer.open(currentFileName, offset, pc);
+if (currentFileName.endsWith(".bz") || 
currentFileName.endsWith(".bz2")) {
+is = new CBZip2InputStream((SeekableInputStream)is, 9);
+} else if (currentFileName.endsWith(".gz")) {
+is = new GZIPInputStream(is);
+}
+
{code}

Isnt this blocked on https://issues.apache.org/jira/browse/PIG-930 ?

5.
{code}
default: // We don't deal with ERR/NULL. just pass them down
return res;
{code}
 should be changed to   
{code}
default: 
throwProcessingException(false,null);
{code}

because if status is Error, execution should be stopped and exception should be 
thrown as early as possible instead of continue doing work which will be 
wasted. If status is Null NPE will occur while doing join.

6.
{code}
InputStream is = FileLocalizer.open(rightInputFileName, pc);
rightLoader.bindTo(rightInputFileName, new BufferedPositionedInputStream(is), 
0, Long.MAX_VALUE);
{code}
I dont see any use of this code. I think its not required and can be removed.

Infact, there is no need of following function too:
{code}
/**
 * @param rightInputFileName the rightInputFileName to set
 */
public void setRightInputFileName(String rightInputFileName) {
this.rightInputFileName = rightInputFileName;
}
{code} 
file name of right side is obtained from index which is contained in index 
file. Index file is directly passed as a constructor argument of 
indexableLoadFunc, so there is no need of passing rightinputfilename from 
MRCompiler to POMergeJoin.
And if this reasoning is correct then DefaultIndexableLoader.bindTo() should 
throw an IOException, because contract on DefaultIndexableLoader is that it is 
initialized with all the info it needs in constructor and then seekNear is 
called on it to seek to correct location. bindTo() shouldn't be used for this 
loader. 
Also, seekNear() doesn't sound right. How about seekToClosest() ?   

7. I think introducing order preserving flag on logical operator is a good 
idea. 
First its self documenting as the information is contained within operator and 
not checked by doing instanceof else where in code. 
Second its a useful information which if present can help make optimizer smart 
decisions. As an example, optimizer can rewrite a symmetric hash join to 
merge-sort join if all the logical operators in query DAG from join inputs to 
the root has these flags set to true. Without this flag, doing such 
optimizations will be hard.

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  

[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-10 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753850#action_12753850
 ] 

Olga Natkovich commented on PIG-953:


+1 on what Alan said

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-10 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753843#action_12753843
 ] 

Alan Gates commented on PIG-953:


-1 to adding an orderPreserving flag on operators.  We have no intention of 
ever promising that any relational operator beyond Order and Limit preserve 
order.  The fact that some happen to now (like filter) is a side effect of the 
current implementation, not a feature.  If we add a flag, it becomes a feature 
that we will be expected to maintain.

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-10 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753833#action_12753833
 ] 

Dmitriy V. Ryaboy commented on PIG-953:
---

I got my trues and falses reversed on the NPE thing. You are right, the 
function works as intended.
I still think it's too verbose, but agree that it's a style issue -- I guess if 
the commiters like it, it's fine :-)

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-10 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753828#action_12753828
 ] 

Dmitriy V. Ryaboy commented on PIG-953:
---

bq. Pradeep: Pig only guarantees order with limit following order - for any 
other relational operator following order there are no guarantees. Today it is 
true that filter or a column pruning foreach would also preserve order but this 
can change if needed in the future. There explicit code to ensure order-limit 
combination works by preserving order - there is no such explicit check for 
other operators (keeping it open for change in the future)

That actually tells me that an orderPreserving property on a LogicalOperator is 
a really good idea.
That way we can set it to true on all operators that are at the moment 
order-preserving (limit, filter, column-prining foreach), and not commit to 
forever maintaining that contract. If filter starts changing order, the patch 
will simply have to include a change to set orderPreserving to false in 
POFilter, and everything will work automagically.

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-10 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753827#action_12753827
 ] 

Pradeep Kamath commented on PIG-953:


I see the first part as a coding style preference - I have done both styles in 
code myself - don't think it is a major issue with readability with current 
implementation

Can you explain the NPE? If either object is null, the code would return with 
false unless both are null. If checkEquality is false, the caller should know 
that only null equality has been checked thus far and if true was returned then 
the two objects are null and hence equal. My main intent was to have this 
helper function be used from other Class's equals() implementation so that this 
mundane check for null need not be repeated in every equals implementation. 
Maybe I am not understanding your use case better - an example might help.

> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-10 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753823#action_12753823
 ] 

Dmitriy V. Ryaboy commented on PIG-953:
---

Pradeep -- I would argue that my rewrite is not more terse, but less verbose 
:-)  
I just don't see how
{code}
if (boolean) {
  return true;
} else {
  return false;
}
{code}

is ever better or more readable than

{code}
return boolean;
{code}

As far as oddness, I would argue that since the person who wrote the function 
introduced a possible NPE a few lines lower, someone else can hardly be 
expected to use it properly :-).  See my comment about checkEquality being 
false.  The function as written misses a corner case.


> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-10 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753816#action_12753816
 ] 

Pradeep Kamath commented on PIG-953:


Response to previous comment:

Dmitriy:  seekNear seems ambiguous, as "near" is a generic concept that does 
not necessarily imply "before or to, but not after" - which is what this method 
is required to do. How about "seekBefore()"?
Pradeep: I had initially thought of naming this method seekBefore(). However 
for the case where the key we are using to seek not being present in the right 
input, the loader can either position at the biggest value before the key OR 
the smallest value after the key (zebra loader for example can only do the 
latter). The name "seekBefore" suggests that implementations should always seek 
before the key in question - hence I chose seekNear.

Dmitriy: it looks like the large diff blocks in MergeSort et al are mostly 
moves of code blocks, not significant code changes, correct?
Pradeep: I am assuming you meant POMergeJoin - yes - its mostly code move to 
DefaultIndexableLoader


Dmitriy: Why does getAscColumns and getSortColumns make a copy of the list? 
Seems like we can save some memory and cpu here.
Pradeep: Findbugs complains about passing internal members as is in getters 
since the caller can then modifiy these internal members - hence the copy. This 
should not be a performance/memory issue since these copies are 1) on small 
structures 2) in front end at compile time and not at runtime.

Dmitriry: For that matter, why not use a map of (String)colName-> 
(Boolean)ascending instead of 2 lists? One structure, plus O(1) lookup.
Pradeep: This suggestion is reasonable - I picked the current implementation 
since its inline with how these things are represented today in LOSort.  Unless 
the user of SortInfo does lookup using sort column names, we won't get O(1). I 
designed this keeping zebra's use case (which is the only use case at this 
point) and the zebra store function would basically needs to know the sort keys 
in order and which of them are asc/dsc. For this they would iterate over our 
datastructure and require that the ordering of the keys match the 
primary/secondary order of the sort keys - hence a list lends itself better for 
that. I debated using a List> but thought I can avoid Pair 
since its a pig implementation class (if tomorrow zebra wants to expose the 
same SortInfo to its users).

Dmitriy: Not sure about the use of super() in the constructor of a class that 
doesn't extend anything but Object. Is there some magic that requires it?
Pradeep: This was unintended - all thanks to eclipse :) - will remove it a next 
iteration of this patch

Dmitriy: In Log2PhysTranslator, why hardcode the Limit operator? There are 
other operators that don't change sort order, such as filter. Perhaps add a 
method to Logical Operators that indicates if they alter sort order of their 
inputs?
Pradeep: Pig only guarantees order with limit following order - for any other 
relational operator following order there are no guarantees. Today it is true 
that filter or a column pruning foreach would also preserve order but this can 
change if needed in the future. There explicit code to ensure order-limit 
combination works by preserving order - there is no such explicit check for 
other operators (keeping it open for change in the future)

Dmitriy: Even with this rewrite, this seems like an odd function. It being as 
odd as it is leads to it not being used safely when you set checkEquality to 
false (just a few lines later)-- if obj1 is null and obj2 is not, the func 
returns true, you try to call a method on obj1, and get an NPE.
Pradeep: The rewrite is more terse and is another option as against the 
explicit if-else I have - more a coding style issue than correctness. The idea 
of having this function was to serve as a helper for use in Classes which need 
to implement equals().  It is cumbersone that every new class's equals has to 
do the equivalent of what you suggest:
{code}
Util.bothNull(obj1, obj2) || (Util.notNull(obj1, obj2) && obj1.equals(obj2));
{code}

This one method can be used either to only check that both objects are not null 
OR to do that and additionally check equality. Not sure I understand the oddity 
- am I missing something?

Dmitriy: This comment has a typo (and instead of "an"): 
Pradeep: Will fix in next iteration





> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Atta

[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-10 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753801#action_12753801
 ] 

Dmitriy V. Ryaboy commented on PIG-953:
---

Pradeep,

First, I think this is very important to have, not just for Merge but for other 
things that might benefit from knowing sort orders, as well. 

A few minor nits from a cursory glance at the code. I didn't check the actual 
logic very carefully yet -- it looks like the large diff blocks in MergeSort et 
al are mostly moves of code blocks, not significant code changes, correct?

On to the comments:

seekNear seems ambiguous, as "near" is a generic concept that does not 
necessarily imply "before or to, but not after" -- which is what this method is 
required to do. How about "seekBefore()"?

Why does getAscColumns and getSortColumns make a copy of the list?  Seems like 
we can save some memory and cpu here.

For that matter, why not use a map of (String)colName-> (Boolean)ascending 
instead of 2 lists? One structure, plus O(1) lookup.

Not sure about the use of super() in the constructor of a class that doesn't 
extend anything but Object. Is there some magic that requires it?

In Log2PhysTranslator, why hardcode the Limit operator? There are other 
operators that don't change sort order, such as filter. Perhaps add a method to 
Logical Operators that indicates if they alter sort order of their inputs?


in Utils

checkNullEquals is better written as

{code}
if (obj1 == null || obj2 == null) {
return obj1 == obj2;
} else  {
return checkEquality ? obj1.equals(obj2) : true;
}
{code}

Even with this rewrite, this seems like an odd function. It being as odd as it 
is leads to it not being used safely when you set checkEquality to false (just 
a few lines later)-- if obj1 is null and obj2 is not, the func returns true, 
you try to call a method on obj1, and get an NPE.

Probably better not to roll all this into one amorphous function and simply 
write

{code}
Util.bothNull(obj1, obj2) || (Util.notNull(obj1, obj2) && obj1.equals(obj2));
{code}

(the implementations of bothNull and notNull are obvious -- just conjunction 
and disjunction of obj == null)

In StoreConfig
This comment has a typo (and instead of "an"): 
"* 1) the store does not follow and order by"



> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> -
>
> Key: PIG-953
> URL: https://issues.apache.org/jira/browse/PIG-953
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.