Hi Harsh, Thanks for moving the post to the correct list.
William On Wed, Feb 13, 2013 at 12:29 AM, Harsh J <[email protected]> wrote: > Please do not use the general@ lists for any user-oriented questions. > Please redirect them to [email protected] lists, which is where > the user community and questions lie. > > I've moved your post there and have added you on CC in case you > haven't subscribed there. Please reply back only to the user@ > addresses. The general@ list is for Apache Hadoop project-level > management and release oriented discussions alone. > > On Wed, Feb 13, 2013 at 10:54 AM, William Kang <[email protected]> wrote: >> Hi All, >> I am trying to figure out a good solution for such a scenario as following. >> >> 1. I have a 2T file (let's call it A), filled by key/value pairs, >> which is stored in the HDFS with the default 64M block size. In A, >> each key is less than 1K and each value is about 20M. >> >> 2. Occasionally, I will run analysis by using a different type of data >> (usually less than 10G, and let's call it B) and do look-up table >> alike operations by using the values in A. B resides in HDFS as well. >> >> 3. This analysis would require loading only a small number of values >> from A (usually less than 1000 of them) into the memory for fast >> look-up against the data in B. The way B finds the few values in A is >> by looking up for the key in A. >> >> Is there an efficient way to do this? >> >> I was thinking if I could identify the locality of the block that >> contains the few values, I might be able to push the B into the few >> nodes that contains the few values in A? Since I only need to do this >> occasionally, maintaining a distributed database such as HBase cant be >> justified. >> >> Many thanks. >> >> >> Cao > > > > -- > Harsh J
