I see... do you have to do a full cross product or are you able to do a join?
On Tue, Nov 5, 2013 at 11:07 AM, burakkk <burak.isi...@gmail.com> wrote: > There are some small different lookup files so that I need to process each > single lookup files. From your example it can be that way: > > a = LOAD 'small1'; --for example taking source_id=1 --> then find > source_name > d = LOAD 'small2'; --for example taking campaign_id=2 --> then find > campaign_name > e = LOAD 'small3'; --for example taking offer_id=3 --> then find offer_name > B = LOAD 'big'; > C = JOIN B BY 1, A BY 1 USING 'replicated'; > f = JOIN c BY 1, d BY 1 USING 'replicated'; > g = JOIN f BY 1, e BY 1 USING 'replicated'; > dump g; > > small1, small2 and small3 is different files so they store different rows. > At the end of the process I need to attach to all rows in my big file. > I know HDFS doesn't perform well with the small files but originally it > stores in different environment. I pull the data from there and load into > HDFS. Anyway because of our architecture I can't change it right now. > > > Thanks > Best regards... > > > On Tue, Nov 5, 2013 at 7:43 PM, Pradeep Gollakota <pradeep...@gmail.com > >wrote: > > > CROSS is grossly expensive to compute so I’m not surprised that the > > performance is good enough. Are you repeating your LOAD and FILTER op’s > for > > every one of your small files? At the end of the day, what is it that > > you’re trying to accomplish? Find the 1 row you’re after and attach to > all > > rows in your big file? > > > > In terms of using DistributedCache, if you’re computing the cross product > > of two (and no more than two) relations, AND one of the relations is > small > > enough to fit in memory, you can use a replicated JOIN instead which > would > > be much more performant. > > > > A = LOAD 'small'; > > B = LOAD 'big'; > > C = JOIN B BY 1, A BY 1 USING 'replicated'; > > dump C; > > > > Note that the smaller relation that will be loaded into memory needs to > be > > specified second in the JOIN statement. > > > > Also keep in mind that HDFS doesn't perform well with lots of small > files. > > If you're design has (lots of) small files, you might benefit from > loading > > that data into some database (e.g. HBase). > > > > > > On Tue, Nov 5, 2013 at 7:29 AM, burakkk <burak.isi...@gmail.com> wrote: > > > > > Hi, > > > I'm using Pig 0.8.1-cdh3u5. Is there any method to use distributed > cache > > > inside Pig? > > > > > > My problem is that: I have lots of small files in hdfs. Let's say 10 > > files. > > > Each files contain more than one rows but I need only one row. But > there > > > isn't any relationship between each other. So I filter them what I need > > and > > > then join them without any relationship(cross join) This is my > workaround > > > solution: > > > > > > a = load(smallFile1) --ex: rows count: 1000 > > > b = FILTER a BY myrow=='filter by exp1' > > > c = load(smallFile2) --ex: rows count: 30000 > > > d = FILTER c BY myrow2=='filter by exp2' > > > e = CROSS b,d > > > ... > > > f = load(bigFile) --ex:rows count: 50mio > > > g = CROSS e, f > > > > > > But it's performance isn't good enough. So if I can use distributed > cache > > > inside pig script, I can lookup the files which I first read and filter > > in > > > the memory. What is your suggestion? Is there any other performance > > > efficient way to do it? > > > > > > Thanks > > > Best regards... > > > > > > > > > -- > > > > > > *BURAK ISIKLI* | *http://burakisikli.wordpress.com > > > <http://burakisikli.wordpress.com>* > > > > > > > > > -- > > *BURAK ISIKLI* | *http://burakisikli.wordpress.com > <http://burakisikli.wordpress.com>* >