[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index

2009-09-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749806#action_12749806
 ] 

Hadoop QA commented on PIG-934:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12418219/pig-934_2.patch
  against trunk revision 806668.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/4/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/4/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/4/console

This message is automatically generated.

 Merge join implementation currently does not seek to right point on the right 
 side input based on the offset provided by the index
 --

 Key: PIG-934
 URL: https://issues.apache.org/jira/browse/PIG-934
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
 Attachments: pig-934_2.patch


 We use POLoad to seek into right file which has the following code: 
 {noformat}
public void setUp() throws IOException{
 String filename = lFile.getFileName();
 loader = 
 (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
 is = FileLocalizer.open(filename, pc);
 loader.bindTo(filename , new BufferedPositionedInputStream(is), 
 this.offset, Long.MAX_VALUE);
 }
 {noformat}
 Between opening the stream and bindTo we do not seek to the right offset. 
 bindTo itself does not perform any seek.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index

2009-09-01 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749901#action_12749901
 ] 

Ashutosh Chauhan commented on PIG-934:
--

All tests passed on my local box. Not sure why they failed on hudson. 

 Merge join implementation currently does not seek to right point on the right 
 side input based on the offset provided by the index
 --

 Key: PIG-934
 URL: https://issues.apache.org/jira/browse/PIG-934
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
 Attachments: pig-934_2.patch


 We use POLoad to seek into right file which has the following code: 
 {noformat}
public void setUp() throws IOException{
 String filename = lFile.getFileName();
 loader = 
 (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
 is = FileLocalizer.open(filename, pc);
 loader.bindTo(filename , new BufferedPositionedInputStream(is), 
 this.offset, Long.MAX_VALUE);
 }
 {noformat}
 Between opening the stream and bindTo we do not seek to the right offset. 
 bindTo itself does not perform any seek.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index

2009-08-29 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749188#action_12749188
 ] 

Ashutosh Chauhan commented on PIG-934:
--

 Seeking to an offset would only work for a single file - hence maybe have a 
 separate function...

Since open() returns an input stream it is not hard to conceive of usecase when 
one would want to seek into that stream even when filespec points to a 
directory or a glob. We have to define the semantics here. What does seeking in 
a directory/glob means? One reasonable answer is to view all the files in 
directory/glob as one big logical file and offset as an offset in this logical 
file and then seek into this file. Something along the lines of :
{code}
iterator = DataStreamIterator
bytesSeen = 0;
while(itertor.hasNext()){
  open current file pointed by iterator
  bytesSeen += current file length
  if (bytesSeen  offset)
bind to adjusted offset in current file and return
 else
continue; 
}
{code} 

But since there is no requirement for such currently, we can catch the 
situation when seeking is asked for directory/glob and throw an exception (as 
is done in this patch).  Later on, if we decide to support it instead of 
throwing exception, we can implement whatever semantics we decide on. If we 
create a new function with separate name it will be confusing to do these 
changes later on. Moreover, if there is a different function, user of the api 
needs to know about it and deal with it (e.g., need of special constructor in 
POLoad). Presence/absence of offset parameter in argument list I think is a 
sufficient indicator to tell which version of overloaded open() to call if 
there is a need for seek. 
Thoughts?

 Merge join implementation currently does not seek to right point on the right 
 side input based on the offset provided by the index
 --

 Key: PIG-934
 URL: https://issues.apache.org/jira/browse/PIG-934
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
 Attachments: pig-934.patch


 We use POLoad to seek into right file which has the following code: 
 {noformat}
public void setUp() throws IOException{
 String filename = lFile.getFileName();
 loader = 
 (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
 is = FileLocalizer.open(filename, pc);
 loader.bindTo(filename , new BufferedPositionedInputStream(is), 
 this.offset, Long.MAX_VALUE);
 }
 {noformat}
 Between opening the stream and bindTo we do not seek to the right offset. 
 bindTo itself does not perform any seek.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index

2009-08-29 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749196#action_12749196
 ] 

Dmitriy V. Ryaboy commented on PIG-934:
---

Throwing an exception when a seek is past the file boundary seems acceptable to 
me (and preferable to adding new functions and changing upstream code that 
shouldn't care about this detail). Especially since if there is a way to get a 
consistent ordering among files in a directory, it's trivial to later update 
this code to seek past file boundaries and into the next file.

 Merge join implementation currently does not seek to right point on the right 
 side input based on the offset provided by the index
 --

 Key: PIG-934
 URL: https://issues.apache.org/jira/browse/PIG-934
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
 Attachments: pig-934.patch


 We use POLoad to seek into right file which has the following code: 
 {noformat}
public void setUp() throws IOException{
 String filename = lFile.getFileName();
 loader = 
 (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
 is = FileLocalizer.open(filename, pc);
 loader.bindTo(filename , new BufferedPositionedInputStream(is), 
 this.offset, Long.MAX_VALUE);
 }
 {noformat}
 Between opening the stream and bindTo we do not seek to the right offset. 
 bindTo itself does not perform any seek.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index

2009-08-28 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749042#action_12749042
 ] 

Pradeep Kamath commented on PIG-934:


The reason I thought a separate function with a singleFile in the name was 
needed was because the current FileLocalizer.open() can handle directories and 
hence returns a DataStorageInputStreamIterator which internally iterates over 
the underlying multiple streams of the files in the directory. Keeping the same 
name may give the impression that the same capability is present even for the 
version which seeks to an offset. Seeking to an offset would only work for a 
single file - hence maybe have a separate function where the name implies this 
restriction might be cleaner.

 Merge join implementation currently does not seek to right point on the right 
 side input based on the offset provided by the index
 --

 Key: PIG-934
 URL: https://issues.apache.org/jira/browse/PIG-934
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
 Attachments: pig-934.patch


 We use POLoad to seek into right file which has the following code: 
 {noformat}
public void setUp() throws IOException{
 String filename = lFile.getFileName();
 loader = 
 (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
 is = FileLocalizer.open(filename, pc);
 loader.bindTo(filename , new BufferedPositionedInputStream(is), 
 this.offset, Long.MAX_VALUE);
 }
 {noformat}
 Between opening the stream and bindTo we do not seek to the right offset. 
 bindTo itself does not perform any seek.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index

2009-08-26 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748070#action_12748070
 ] 

Pradeep Kamath commented on PIG-934:


To get an idea of how this seeking in case of regular loads in map tasks, I 
looked at PigSlice.java, the seek happens in the init() code before bindTo():
{code}
public void init(DataStorage base) throws IOException {
..

fsis = base.asElement(base.getActiveContainer(), file).sopen();

fsis.seek(start, FLAGS.SEEK_CUR);

 
end = start + getLength();


if (file.endsWith(.bz) || file.endsWith(.bz2)) {

is = new CBZip2InputStream(fsis, 9);

} else if (file.endsWith(.gz)) {

is = new GZIPInputStream(fsis);

// We can't tell how much of the underlying stream GZIPInputStream

// has actually consumed

end = Long.MAX_VALUE;

} else {

is = fsis;

}

loader.bindTo(file.toString(), new BufferedPositionedInputStream(is,

start), start, end);

}
{code}

I think we need a FileLocalizer.sOpenSingleFile() method which can return a 
SeekableInputStream and we can use that in setup() in POLoad.
Something along the lines of :
{code}
static public InputStream open(String fileSpec, PigContext pigContext) throws 
IOException {
fileSpec = checkDefaultPrefix(pigContext.getExecType(), fileSpec);
if (!fileSpec.startsWith(LOCAL_PREFIX)) {
init(pigContext);
ElementDescriptor elem = 
pigContext.getDfs().asElement(fullPath(fileSpec, pigContext));
return elem.sopen();
}
else {
fileSpec = fileSpec.substring(LOCAL_PREFIX.length());
//buffering because we only want buffered streams to be passed to 
load functions.
/*return new BufferedInputStream(new FileInputStream(fileSpec));*/
init(pigContext);
ElementDescriptor elem = 
pigContext.getLfs().asElement(fullPath(fileSpec, pigContext));
return elem.sopen;
}
}

{code}
 The above code would only work with single files and not dirs which should be 
ok for merge join. We should probably set this up with a new constructor in 
POLoad which also indicates that a single file is being processed.



 Merge join implementation currently does not seek to right point on the right 
 side input based on the offset provided by the index
 --

 Key: PIG-934
 URL: https://issues.apache.org/jira/browse/PIG-934
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath

 We use POLoad to seek into right file which has the following code: 
 {noformat}
public void setUp() throws IOException{
 String filename = lFile.getFileName();
 loader = 
 (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
 is = FileLocalizer.open(filename, pc);
 loader.bindTo(filename , new BufferedPositionedInputStream(is), 
 this.offset, Long.MAX_VALUE);
 }
 {noformat}
 Between opening the stream and bindTo we do not seek to the right offset. 
 bindTo itself does not perform any seek.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.