Re: Question about FileInputFormat splits

Edward J. Yoon Mon, 20 Oct 2014 15:01:26 -0700

Hi it works as you expected? I thought bsp.input.runtime.partitioning should be 
true. :0


--
Best Regards, Edward J. Yoon
Chief Executive Officer
DataSayer Co., Ltd.

> 2014. 10. 21., 오전 6:31, Leonidas Fegaras <[email protected]> 작성:
> 
> Hi Edward,
> OK. It works now. I used the following in hama-site.xml:
> 
>  <property>
>    <name>bsp.input.runtime.partitioning</name>
>    <value>false</value>
>  </property>
> 
> and re-started bspd. The correct code for the Job is:
> 
> job.setNumBspTask(10);
> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
> 
> Maybe you should explain this in the Hama Wiki.
> Thanks.
> Leonidas
> 
> On 10/20/2014 02:19 PM, Leonidas Fegaras wrote:
>> Hi Edward,
>> Thank you for the reply.
>> But I want the opposite: I want to create more tasks than blocks, not
>> fewer tasks than blocks.
>> That is, I want to be able to send less than one block to each task (for
>> example, only 10000 bytes). Sending less data to a task will speed-up
>> execution and will require less memory at each node. Hadoop map-reduce,
>> Spark, and Flink allow you to use a split size smaller than a block.
>> Also, I used to be able to do this with Hama 0.5.0 but not with Hama
>> 0.6.4. Did you remove this capability because it is a bad idea or
>> because it is very hard to implement?
>> 
>> Based on your instructions, I tried the following:
>> 
>>      job.setNumBspTask(10);
>>      job.setBoolean("bsp.input.runtime.partitioning",false);
>> job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);
>> 
>> I get the following error:
>> 
>> java.lang.ArrayIndexOutOfBoundsException: 1
>>      at org.apache.hama.bsp.BSPJobClient.writeSplits(BSPJobClient.java:556)
>>      at
>> org.apache.hama.bsp.BSPJobClient.submitJobInternal(BSPJobClient.java:354)
>>      at org.apache.hama.bsp.BSPJobClient.submitJob(BSPJobClient.java:296)
>>      at org.apache.hama.bsp.BSPJob.submit(BSPJob.java:219)
>>      at org.apache.hama.bsp.BSPJob.waitForCompletion(BSPJob.java:226)
>> 
>> Thanks.
>> Leonidas
>> 
>> 
>> On 10/20/2014 10:06 AM, Edward J. Yoon wrote:
>>> Hi Leonidas,
>>> 
>>> The bsp.min.split.size property is used to prevent to create too many
>>> tasks, like Hadoop MR (NOTE: if bsp.min.split.size is less than block
>>> size then 1 block is sent to each task).
>>> 
>>> I guess this will work fine. BTW, if you set the input partitioner
>>> then input partitioner creates the new partitions as you specified in
>>> the setNumBspTask() method (graph job pre-processes the (hash) input
>>> partition by default).
>>> 
>>> Thanks.
>>> 
>>> --
>>> Best Regards, Edward J. Yoon
>>> Chief Executive Officer
>>> DataSayer Co., Ltd.
>>> 
>>>> 2014. 10. 20., 오후 10:51, Leonidas Fegaras <[email protected]
>>>> <mailto:[email protected]>> 작성:
>>>> 
>>>> Dear Hama developers,
>>>> I still have a problem setting the split size of an HDFS input file
>>>> using Hama 0.6.4.  For example, when I use:
>>>> 
>>>> BSPJob job = new BSPJob(conf,BSPop.class);
>>>> job.setNumBspTask(10);
>>>> job.setLong("bsp.min.split.size",10000L);   // 10000 bytes
>>>> 
>>>> For a small file with 2 blocks, this will use only 2 BSP tasks (one
>>>> for each block), instead of 10.
>>>> This used to work in Hama 0.5.0.
>>>> Any suggestions?
>>>> Thanks.
>>>> Leonidas Fegaras
>>>> 
>>>> On 01/04/2013 05:45 PM, Edward J. Yoon wrote:
>>>>> Hello,
>>>>> 
>>>>>> than a block. But if you have more nodes in your cluster than data
>>>>>> blocks,
>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>> block. Is
>>>>> You're right. So, we're working on partitioning issues now.
>>>>> 
>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>> block. Is
>>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>>> Yes. But, Hama 0.6.1 version will support it.
>>>>> 
>>>>> On Sat, Jan 5, 2013 at 4:59 AM, Leonidas Fegaras
>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>> Dear Hama developers,
>>>>>> It seems that the splits generated by the FileInputFormat in Hama 0.6.0
>>>>>> cannot be smaller than a block. In Hama 0.5.0, I could set any
>>>>>> split size
>>>>>> using  job.set("bsp.min.split.size",...) and set the task numbers using
>>>>>> job.setNumBspTask(...). This is ignored by Hama 0.6.0 for a split
>>>>>> smaller
>>>>>> than a block. But if you have more nodes in your cluster than data
>>>>>> blocks,
>>>>>> you may get faster execution if you allow splits smaller than a
>>>>>> block. Is
>>>>>> there any way to use splits smaller than a block in Hama 0.6.0?
>>>>>> Thanks for your help,
>>>>>> Leonidas
>>>>>> 
>>>>> 
>

Re: Question about FileInputFormat splits

Reply via email to