Thanks Marty, Sean, yeah, I took a look at the source code yesterday and realized that it is not thread safe as well,
I am working on a high performance mapper that require making the mapper thread safe so I could exploit the data parallelism that comes with processing multiple input <key, val> pairs to a single mapper, I am currently researching into if there is any easy way that I could make the CIMapper implementation thread safe by may be making a few key data structures that are thread safe, like the OpenIntDoubleHashMap, and hopefully this won't screw up the correctness of the algorithm itself, Yunming On Dec 20, 2012, at 9:07 AM, Marty Kube <[email protected]> wrote: > Sean is right, most MR code is not and does not need to be thread safe. > > Why are you writing a multi-threaded mapper? > > On 12/19/2012 07:50 PM, Sean Owen wrote: >> Hadoop will only use one thread with one Mapper or Reducer instance. Unless >> you are somehow spawning threads on your own concurrency should not be an >> issue. I don't known if this behavior is guaranteed but seems to be how it >> always works. >> On Dec 19, 2012 4:03 PM, "Yunming Zhang" <[email protected]> wrote: >> >>> Hi , >>> >>> I am developing a custom mapper that is somewhat similar to the >>> multithreaded mapper that came with Hadoop, and I am getting weird errors >>> when running using multiple threads processing multiple input key, value >>> pairs simultaneously, here is the stack trace, I looked into >>> OpenIntDoubleHashMap, and it seems to be stemmed from null values stored in >>> the tables, >>> >>> attempt_201212190955_0004_m_000000_0: >>> java.lang.ArrayIndexOutOfBoundsException: 24 >>> attempt_201212190955_0004_m_000000_0: at >>> org.apache.mahout.math.map.OpenIntDoubleHashMap.indexOfKey(OpenIntDoubleHashMap.java:278) >>> attempt_201212190955_0004_m_000000_0: at >>> org.apache.mahout.math.map.OpenIntDoubleHashMap.get(OpenIntDoubleHashMap.java:198) >>> attempt_201212190955_0004_m_000000_0: at >>> org.apache.mahout.math.RandomAccessSparseVector.getQuick(RandomAccessSparseVector.java:130) >>> attempt_201212190955_0004_m_000000_0: at >>> org.apache.mahout.math.AbstractVector.assign(AbstractVector.java:738) >>> attempt_201212190955_0004_m_000000_0: at >>> org.apache.mahout.clustering.AbstractCluster.observe(AbstractCluster.java:263) >>> attempt_201212190955_0004_m_000000_0: at >>> org.apache.mahout.clustering.AbstractCluster.observe(AbstractCluster.java:234) >>> attempt_201212190955_0004_m_000000_0: at >>> org.apache.mahout.clustering.AbstractCluster.observe(AbstractCluster.java:229) >>> attempt_201212190955_0004_m_000000_0: at >>> org.apache.mahout.clustering.AbstractCluster.observe(AbstractCluster.java:37) >>> attempt_201212190955_0004_m_000000_0: at >>> org.apache.mahout.clustering.classify.ClusterClassifier.train(ClusterClassifier.java:158) >>> attempt_201212190955_0004_m_000000_0: at >>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:46) >>> attempt_201212190955_0004_m_000000_0: at >>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18) >>> >>> Not sure if anyone knows if it is inherently thread safe to process >>> multiple input key, val pair to the mapper simultaneously ? >>> >>> Thanks >>> >>> Yunming >
