Writing thread safe code is hard. Don't do it unless you have too. On Dec 20, 2012, at 4:28 AM, Sean Owen <[email protected]> wrote:
> ... but making the implementation thread-safe won't make it be used by > multiple threads. If you want more parallelism, suggest to Hadoop to > use more mappers by reducing the max input split size. But this is > still not going to require your mappers to be thread-safe. > > if you mean you are making your own parallelism in miniature by > writing a multi-threaded mapper, I wouldn't bother. Just use more > parallelism via Hadoop. > > On Thu, Dec 20, 2012 at 3:31 AM, Yunming Zhang > <[email protected]> wrote: >> Thanks Marty, Sean, >> >> yeah, I took a look at the source code yesterday and realized that it is not >> thread safe as well, >> >> I am working on a high performance mapper that require making the mapper >> thread safe so I could exploit the data parallelism that comes with >> processing multiple input <key, val> pairs to a single mapper, >> >> I am currently researching into if there is any easy way that I could make >> the CIMapper implementation thread safe by may be making a few key data >> structures that are thread safe, like the OpenIntDoubleHashMap, and >> hopefully this won't screw up the correctness of the algorithm itself, >> >> Yunming >> >> On Dec 20, 2012, at 9:07 AM, Marty Kube >> <[email protected]> wrote: >> >>> Sean is right, most MR code is not and does not need to be thread safe. >>> >>> Why are you writing a multi-threaded mapper? >>> >>> On 12/19/2012 07:50 PM, Sean Owen wrote: >>>> Hadoop will only use one thread with one Mapper or Reducer instance. Unless >>>> you are somehow spawning threads on your own concurrency should not be an >>>> issue. I don't known if this behavior is guaranteed but seems to be how it >>>> always works. >>>> On Dec 19, 2012 4:03 PM, "Yunming Zhang" <[email protected]> >>>> wrote: >>>> >>>>> Hi , >>>>> >>>>> I am developing a custom mapper that is somewhat similar to the >>>>> multithreaded mapper that came with Hadoop, and I am getting weird errors >>>>> when running using multiple threads processing multiple input key, value >>>>> pairs simultaneously, here is the stack trace, I looked into >>>>> OpenIntDoubleHashMap, and it seems to be stemmed from null values stored >>>>> in >>>>> the tables, >>>>> >>>>> attempt_201212190955_0004_m_000000_0: >>>>> java.lang.ArrayIndexOutOfBoundsException: 24 >>>>> attempt_201212190955_0004_m_000000_0: at >>>>> org.apache.mahout.math.map.OpenIntDoubleHashMap.indexOfKey(OpenIntDoubleHashMap.java:278) >>>>> attempt_201212190955_0004_m_000000_0: at >>>>> org.apache.mahout.math.map.OpenIntDoubleHashMap.get(OpenIntDoubleHashMap.java:198) >>>>> attempt_201212190955_0004_m_000000_0: at >>>>> org.apache.mahout.math.RandomAccessSparseVector.getQuick(RandomAccessSparseVector.java:130) >>>>> attempt_201212190955_0004_m_000000_0: at >>>>> org.apache.mahout.math.AbstractVector.assign(AbstractVector.java:738) >>>>> attempt_201212190955_0004_m_000000_0: at >>>>> org.apache.mahout.clustering.AbstractCluster.observe(AbstractCluster.java:263) >>>>> attempt_201212190955_0004_m_000000_0: at >>>>> org.apache.mahout.clustering.AbstractCluster.observe(AbstractCluster.java:234) >>>>> attempt_201212190955_0004_m_000000_0: at >>>>> org.apache.mahout.clustering.AbstractCluster.observe(AbstractCluster.java:229) >>>>> attempt_201212190955_0004_m_000000_0: at >>>>> org.apache.mahout.clustering.AbstractCluster.observe(AbstractCluster.java:37) >>>>> attempt_201212190955_0004_m_000000_0: at >>>>> org.apache.mahout.clustering.classify.ClusterClassifier.train(ClusterClassifier.java:158) >>>>> attempt_201212190955_0004_m_000000_0: at >>>>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:46) >>>>> attempt_201212190955_0004_m_000000_0: at >>>>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18) >>>>> >>>>> Not sure if anyone knows if it is inherently thread safe to process >>>>> multiple input key, val pair to the mapper simultaneously ? >>>>> >>>>> Thanks >>>>> >>>>> Yunming >>
