[jira] [Commented] (HAMA-757) The partitioning job output should be un-splitable

2013-05-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13659226#comment-13659226
 ] 

Hudson commented on HAMA-757:
-

Integrated in Hama-Nightly #911 (See 
[https://builds.apache.org/job/Hama-Nightly/911/])
HAMA-757: The partitioning job output should be un-splitable (MaoYuan Xian 
via edwardyoon) (Revision 1482677)

 Result = SUCCESS
edwardyoon : 
Files : 
* /hama/trunk/CHANGES.txt
* /hama/trunk/core/src/main/java/org/apache/hama/bsp/BSPJobClient.java
* 
/hama/trunk/core/src/main/java/org/apache/hama/bsp/NonSplitSequenceFileInputFormat.java


> The partitioning job output should be un-splitable
> --
>
> Key: HAMA-757
> URL: https://issues.apache.org/jira/browse/HAMA-757
> Project: Hama
>  Issue Type: Bug
>  Components: bsp core
>Affects Versions: 0.6.1
>Reporter: MaoYuan Xian
>Assignee: MaoYuan Xian
> Fix For: 0.6.2
>
> Attachments: HAMA-757.patch
>
>
> When the output sequence files from partitioning job are large(bigger than 
> two hdfs file block size), the second round of the job (using these sequence 
> file as input) will start up more tasks than client want. Some times, this 
> uncertainty make the job exceed the cluster slot capacity.
> In the real project, I implemented an new Inputformat which marked as 
> un-splitable to solve the problem. Is there any better way?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HAMA-757) The partitioning job output should be un-splitable

2013-05-14 Thread Edward J. Yoon (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13657093#comment-13657093
 ] 

Edward J. Yoon commented on HAMA-757:
-

I'm +1

> The partitioning job output should be un-splitable
> --
>
> Key: HAMA-757
> URL: https://issues.apache.org/jira/browse/HAMA-757
> Project: Hama
>  Issue Type: Bug
>  Components: bsp core
>Affects Versions: 0.6.1
>Reporter: MaoYuan Xian
> Attachments: HAMA-757.patch
>
>
> When the output sequence files from partitioning job are large(bigger than 
> two hdfs file block size), the second round of the job (using these sequence 
> file as input) will start up more tasks than client want. Some times, this 
> uncertainty make the job exceed the cluster slot capacity.
> In the real project, I implemented an new Inputformat which marked as 
> un-splitable to solve the problem. Is there any better way?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HAMA-757) The partitioning job output should be un-splitable

2013-05-14 Thread Edward J. Yoon (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13657063#comment-13657063
 ] 

Edward J. Yoon commented on HAMA-757:
-

Hmm, it's somewhat hard to decide. Any other opinions?

> The partitioning job output should be un-splitable
> --
>
> Key: HAMA-757
> URL: https://issues.apache.org/jira/browse/HAMA-757
> Project: Hama
>  Issue Type: Bug
>  Components: bsp core
>Affects Versions: 0.6.1
>Reporter: MaoYuan Xian
> Attachments: HAMA-757.patch
>
>
> When the output sequence files from partitioning job are large(bigger than 
> two hdfs file block size), the second round of the job (using these sequence 
> file as input) will start up more tasks than client want. Some times, this 
> uncertainty make the job exceed the cluster slot capacity.
> In the real project, I implemented an new Inputformat which marked as 
> un-splitable to solve the problem. Is there any better way?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HAMA-757) The partitioning job output should be un-splitable

2013-05-14 Thread MaoYuan Xian (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13656983#comment-13656983
 ] 

MaoYuan Xian commented on HAMA-757:
---

Uploaded the patch.

> The partitioning job output should be un-splitable
> --
>
> Key: HAMA-757
> URL: https://issues.apache.org/jira/browse/HAMA-757
> Project: Hama
>  Issue Type: Bug
>  Components: bsp core
>Affects Versions: 0.6.1
>Reporter: MaoYuan Xian
> Attachments: HAMA-757.patch
>
>
> When the output sequence files from partitioning job are large(bigger than 
> two hdfs file block size), the second round of the job (using these sequence 
> file as input) will start up more tasks than client want. Some times, this 
> uncertainty make the job exceed the cluster slot capacity.
> In the real project, I implemented an new Inputformat which marked as 
> un-splitable to solve the problem. Is there any better way?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HAMA-757) The partitioning job output should be un-splitable

2013-05-13 Thread Edward J. Yoon (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13655844#comment-13655844
 ] 

Edward J. Yoon commented on HAMA-757:
-

We might be able to approximately calculate the size of partition file before 
merging. 

{code}
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf,
partitionFile, keyClass, valueClass, CompressionType.NONE);

for (int i = 0; i < files.length; i++) {
  LOG.debug("merge '" + files[i].getPath() + "' into " + partitionDir
  + "/" + getPartitionName(partitionID));
{code}

But, I didn't understand your solution exactly yet. Please feel free to upload 
your patch. Let's think about it more and discuss.

> The partitioning job output should be un-splitable
> --
>
> Key: HAMA-757
> URL: https://issues.apache.org/jira/browse/HAMA-757
> Project: Hama
>  Issue Type: Bug
>  Components: bsp core
>Affects Versions: 0.6.1
>Reporter: MaoYuan Xian
>
> When the output sequence files from partitioning job are large(bigger than 
> two hdfs file block size), the second round of the job (using these sequence 
> file as input) will start up more tasks than client want. Some times, this 
> uncertainty make the job exceed the cluster slot capacity.
> In the real project, I implemented an new Inputformat which marked as 
> un-splitable to solve the problem. Is there any better way?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HAMA-757) The partitioning job output should be un-splitable

2013-05-13 Thread MaoYuan Xian (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13655839#comment-13655839
 ] 

MaoYuan Xian commented on HAMA-757:
---

Yes. DFSClient creates files using "dfs.block.size" value as block size 
reference. But, using this way will requires customer or hama job client know 
each partition's size well and set the correct value when creates file 
outputstream.

> The partitioning job output should be un-splitable
> --
>
> Key: HAMA-757
> URL: https://issues.apache.org/jira/browse/HAMA-757
> Project: Hama
>  Issue Type: Bug
>  Components: bsp core
>Affects Versions: 0.6.1
>Reporter: MaoYuan Xian
>
> When the output sequence files from partitioning job are large(bigger than 
> two hdfs file block size), the second round of the job (using these sequence 
> file as input) will start up more tasks than client want. Some times, this 
> uncertainty make the job exceed the cluster slot capacity.
> In the real project, I implemented an new Inputformat which marked as 
> un-splitable to solve the problem. Is there any better way?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HAMA-757) The partitioning job output should be un-splitable

2013-05-13 Thread Edward J. Yoon (JIRA)

[ 
https://issues.apache.org/jira/browse/HAMA-757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13655827#comment-13655827
 ] 

Edward J. Yoon commented on HAMA-757:
-

{quote} Some times, this uncertainty make the job exceed the cluster slot 
capacity. {quote}

Thanks for sharing your great experience! If I remember correctly, the block 
size is configurable per file (when create a file on HDFS).

> The partitioning job output should be un-splitable
> --
>
> Key: HAMA-757
> URL: https://issues.apache.org/jira/browse/HAMA-757
> Project: Hama
>  Issue Type: Bug
>  Components: bsp core
>Affects Versions: 0.6.1
>Reporter: MaoYuan Xian
>
> When the output sequence files from partitioning job are large(bigger than 
> two hdfs file block size), the second round of the job (using these sequence 
> file as input) will start up more tasks than client want. Some times, this 
> uncertainty make the job exceed the cluster slot capacity.
> In the real project, I implemented an new Inputformat which marked as 
> un-splitable to solve the problem. Is there any better way?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira