Re: Question about FileInputFormat splits

Leonidas Fegaras Wed, 22 Oct 2014 10:06:12 -0700

Hi Edward,
I am testing my programs with:
job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);

The splitter works fine for hadoop sequence files but it gets errors fortext files.From the messages below, it seems that the splitter didn't produce asplit-00001 file.Then the BSPJobClient.readSplitFile methods gets 4 splits but the splitIDs are 0, 2, 3, and 4.Is this a Hama bug or is my InputFormat wrong? (it works fine withoutsetPartitioner)

Thanks.
Leonidas


14/10/22 09:17:59 INFO bsp.FileInputFormat: Total input paths to process : 1
14/10/22 09:17:59 INFO bsp.FileInputFormat: Total input paths to process : 1
14/10/22 09:17:59 INFO bsp.FileInputFormat: Total input paths to process : 1
14/10/22 09:18:00 INFO bsp.BSPJobClient: Running job: job_201410220850_0006
14/10/22 09:18:03 INFO bsp.BSPJobClient: Current supersteps number: 0
14/10/22 09:18:09 INFO bsp.BSPJobClient: Current supersteps number: 2
14/10/22 09:18:12 INFO bsp.BSPJobClient: The total number of supersteps: 2
14/10/22 09:18:12 INFO bsp.BSPJobClient: Counters: 6

14/10/22 09:18:12 INFO bsp.BSPJobClient:org.apache.hama.bsp.JobInProgress$JobCounter

14/10/22 09:18:12 INFO bsp.BSPJobClient:     SUPERSTEPS=2
14/10/22 09:18:12 INFO bsp.BSPJobClient: LAUNCHED_TASKS=1

14/10/22 09:18:12 INFO bsp.BSPJobClient:org.apache.hama.bsp.BSPPeerImpl$PeerCounter

14/10/22 09:18:12 INFO bsp.BSPJobClient: SUPERSTEP_SUM=2
14/10/22 09:18:12 INFO bsp.BSPJobClient: TIME_IN_SYNC_MS=117
14/10/22 09:18:12 INFO bsp.BSPJobClient: IO_BYTES_READ=511839
14/10/22 09:18:12 INFO bsp.BSPJobClient: TASK_INPUT_RECORDS=12373
14/10/22 09:18:12 INFO bsp.FileInputFormat: Total input paths to process : 4
java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 4

atorg.apache.hama.bsp.BSPJobClient.readSplitFile(BSPJobClient.java:611)

    at org.apache.hama.bsp.JobInProgress.initTasks(JobInProgress.java:261)
    at org.apache.hama.bsp.QueueManager.initJob(QueueManager.java:44)

atorg.apache.hama.bsp.SimpleTaskScheduler$JobListener.jobAdded(SimpleTaskScheduler.java:117)

    at org.apache.hama.bsp.BSPMaster.addJob(BSPMaster.java:753)
    at org.apache.hama.bsp.BSPMaster.submitJob(BSPMaster.java:614)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

atsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)atsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:601)
    at org.apache.hama.ipc.RPC$Server.call(RPC.java:613)
    at org.apache.hama.ipc.Server$Handler$1.run(Server.java:1211)
    at org.apache.hama.ipc.Server$Handler$1.run(Server.java:1207)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)

atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)

    at org.apache.hama.ipc.Server$Handler.run(Server.java:1206)

> hadoop fs -ls /tmp/hama-parts/job_201410220850_0005
Found 4 items

-rw-r--r-- 3 hadoop supergroup 240516 2014-10-22 09:18/tmp/hama-parts/job_201410220850_0005/part-00000-rw-r--r-- 3 hadoop supergroup 242699 2014-10-22 09:18/tmp/hama-parts/job_201410220850_0005/part-00002-rw-r--r-- 3 hadoop supergroup 5710 2014-10-22 09:18/tmp/hama-parts/job_201410220850_0005/part-00003-rw-r--r-- 3 hadoop supergroup 247892 2014-10-22 09:18/tmp/hama-parts/job_201410220850_0005/part-00004






On 10/20/2014 04:59 PM, Edward J. Yoon wrote:

Hi it works as you expected? I thought bsp.input.runtime.partitioningshould be true. :0


--
Best Regards, Edward J. Yoon
Chief Executive Officer
DataSayer Co., Ltd.

2014. 10. 21., 오전 6:31, Leonidas Fegaras <[email protected]<mailto:[email protected]>> 작성:


Hi Edward,
OK. It works now. I used the following in hama-site.xml:

 <property>
   <name>bsp.input.runtime.partitioning</name>
   <value>false</value>
 </property>

and re-started bspd. The correct code for the Job is:

job.setNumBspTask(10);
job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);

Maybe you should explain this in the Hama Wiki.
Thanks.
Leonidas

On 10/20/2014 02:19 PM, Leonidas Fegaras wrote:

Hi Edward,
Thank you for the reply.
But I want the opposite: I want to create more tasks than blocks, not
fewer tasks than blocks.
That is, I want to be able to send less than one block to each task (for
example, only 10000 bytes). Sending less data to a task will speed-up
execution and will require less memory at each node. Hadoop map-reduce,
Spark, and Flink allow you to use a split size smaller than a block.
Also, I used to be able to do this with Hama 0.5.0 but not with Hama
0.6.4. Did you remove this capability because it is a bad idea or
because it is very hard to implement?

Based on your instructions, I tried the following:

     job.setNumBspTask(10);
     job.setBoolean("bsp.input.runtime.partitioning",false);
job.setPartitioner(org.apache.hama.bsp.HashPartitioner.class);

I get the following error:

java.lang.ArrayIndexOutOfBoundsException: 1

atorg.apache.hama.bsp.BSPJobClient.writeSplits(BSPJobClient.java:556)

     at
org.apache.hama.bsp.BSPJobClient.submitJobInternal(BSPJobClient.java:354)

atorg.apache.hama.bsp.BSPJobClient.submitJob(BSPJobClient.java:296)

     at org.apache.hama.bsp.BSPJob.submit(BSPJob.java:219)
     at org.apache.hama.bsp.BSPJob.waitForCompletion(BSPJob.java:226)

Thanks.
Leonidas


On 10/20/2014 10:06 AM, Edward J. Yoon wrote:

Hi Leonidas,

The bsp.min.split.size property is used to prevent to create too many
tasks, like Hadoop MR (NOTE: if bsp.min.split.size is less than block
size then 1 block is sent to each task).

I guess this will work fine. BTW, if you set the input partitioner
then input partitioner creates the new partitions as you specified in
the setNumBspTask() method (graph job pre-processes the (hash) input
partition by default).

Thanks.

--
Best Regards, Edward J. Yoon
Chief Executive Officer
DataSayer Co., Ltd.

2014. 10. 20., 오후 10:51, Leonidas Fegaras <[email protected]<mailto:[email protected]>

<mailto:[email protected]>> 작성:

Dear Hama developers,
I still have a problem setting the split size of an HDFS input file
using Hama 0.6.4.  For example, when I use:

BSPJob job = new BSPJob(conf,BSPop.class);
job.setNumBspTask(10);
job.setLong("bsp.min.split.size",10000L);   // 10000 bytes

For a small file with 2 blocks, this will use only 2 BSP tasks (one
for each block), instead of 10.
This used to work in Hama 0.5.0.
Any suggestions?
Thanks.
Leonidas Fegaras

On 01/04/2013 05:45 PM, Edward J. Yoon wrote:

Hello,

than a block. But if you have more nodes in your cluster than data
blocks,
you may get faster execution if you allow splits smaller than a
block. Is

You're right. So, we're working on partitioning issues now.

you may get faster execution if you allow splits smaller than a
block. Is
there any way to use splits smaller than a block in Hama 0.6.0?

Yes. But, Hama 0.6.1 version will support it.

On Sat, Jan 5, 2013 at 4:59 AM, Leonidas Fegaras

<[email protected] <mailto:[email protected]><mailto:[email protected]>> wrote:

Dear Hama developers,

It seems that the splits generated by the FileInputFormat inHama 0.6.0

cannot be smaller than a block. In Hama 0.5.0, I could set any
split size

using job.set("bsp.min.split.size",...) and set the tasknumbers using

job.setNumBspTask(...). This is ignored by Hama 0.6.0 for a split
smaller
than a block. But if you have more nodes in your cluster than data
blocks,
you may get faster execution if you allow splits smaller than a
block. Is
there any way to use splits smaller than a block in Hama 0.6.0?
Thanks for your help,
Leonidas

Re: Question about FileInputFormat splits

Reply via email to