Hi Simply because you have to pay on top of every instance hour. I currently need about 4800h of r3.2xlarge EMR takes 0.18$ instance hour so it would be 864$ just in EMR costs (spot prices are around 0.12$/h).
Just to stay on topic I thought about getting 40 i2.xlarge instances which have about 1TB of combined ram and 32TB of combined SSD space would this be enough to load a 10TB parquet or do I need more RAM/Disk spill space? On Mon, Nov 27, 2017 at 6:06 PM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi, > > I think that I have mentioned all the required alternatives. However I am > quite curious as to how did you conclude that processing using EMR is going > to be more expensive than using any other stack. I have been using EMR > since last 6 years (almost about the time it came out), and have always > found it cheap, reliable, safe and stable (ofcourse its like fire, if you > are not careful it can end up burning you financially). > > Regards, > Gourav Sengupta > > On Mon, Nov 27, 2017 at 12:58 PM, Alexander Czech < > alexander.cz...@googlemail.com> wrote: > >> I don't use EMR I spin my clusters up using flintrock (beeing a student >> my budget is slim), my code is writen in pyspark and my data is in the >> us-east-1 region (N. Virginia). I will do my best explaining it with tables: >> >> My input with a size of (10TB) sits in multiple (~150) parquets on S3 >> >> +-----------+--------------------------+-------+------+-------+ >> | uri| link_list|lang_id|vector|content| >> +-----------+--------------------------+-------+------+-------+ >> |www.123.com|[www.123.com,www.abc.com,]| null| null| null| >> |www.abc.com|[www.opq.com,www.456.com,]| null| null| null| >> |www.456.com|[www.xyz.com,www.abc.com,]| null| null| null| >> >> >> *(link_list is a ArrayType(StringType()))* >> >> Step1 : I only load the uri and link_list columns (but they make up the >> bulk of the data). Then every uri is given a unique ID with >> df.withColumn('uri_id', >> func.monotonically_increasing_id()) >> resulting in a dataframe looking like this >> >> *DF_A*: >> >> +-----------+--------------------------+-------+ >> | uri| link_list| uri_id| >> +-----------+--------------------------+-------+ >> |www.123.com|[www.123.com,www.abc.com,]| 1| >> |www.abc.com|[www.opq.com,www.456.com,]| 2| >> |www.456.com|[www.xyz.com,www.abc.com,]| 3| >> >> Step 2: I create another dataframe containing only the uri and uri_id which >> is renamed to link_id fields >> >> *DF_B*: >> +-----------+--------+ >> | uri| link_id| >> +-----------+--------+ >> |www.123.com| 1| >> |www.abc.com| 2| >> |www.456.com| 3| >> >> Step 3: Now I exploded the link_list field in *DF_A* with >> *DF_A*.select("uri_id", func.explode("link_list").alias("link")) >> This gives me >> >> *DF_C*: >> +-----------+-------+ >> | link| uri_id| >> +-----------+-------+ >> |www.123.com| 1| >> |www.abc.com| 1| >> |www.opq.com| 2| >> |www.456.com| 2| >> |www.xyz.com| 3| >> |www.abc.com| 3| >> >> >> Lastly I Join DF_B DF_C *DF_C*.join(*DF_B*, *DF_C*.link==*DF_B*.uri, >> "left_outer").drop("uri") Which results in the final dataframe: >> >> >> +-----------+-------+--------+ >> | link| uri_id| link_id| >> +-----------+-------+--------+ >> |www.123.com| 1| 1| >> |www.abc.com| 1| 2| >> |www.opq.com| 2| null| >> |www.456.com| 2| 3| >> |www.xyz.com| 3| null| >> |www.abc.com| 3| 1| >> >> (in code the field link is also dropped but this makes it hopefully more >> intelligible this way) >> >> >> the rest is to just join the uri_id with the lang_id,vector,content rows >> that are not null which is trivial. >> >> I hope this makes it more readable. If there is an aws service that makes it >> easier for me to deal with the data, since it is basically "just" database >> operations I'm also happy to hear about it. >> I got a few days on my hands until the preprocessing is done but I'm not >> sure if the explod in step 3 can be done in another aws service. >> >> thanks! >> >> >> On Mon, Nov 27, 2017 at 12:32 PM, Gourav Sengupta < >> gourav.sengu...@gmail.com> wrote: >> >>> Hi, >>> >>> it would be much simpler in case you just provide two tables with the >>> samples of input and output. Going through the verbose text and trying to >>> read and figure out what is happening is a bit daunting. >>> >>> Personally, given that you have your entire data in Parquet, I do not >>> think that you will need to have a large cluster size at all. You can do it >>> with a small size cluster as well, but depending on the cluster size, you >>> might want to create intermediate staging tables or persist the data. >>> >>> Also it will be of help if you could kindly provide the EMR version that >>> you are using. >>> >>> >>> On another note also mention the AWS Region you are in. If Redshift >>> Spectrum is available, or you can use Athena, or you can use Presto, then >>> running massive aggregates over huge data sets at fraction of cost and at >>> least 10x speed may be handy as well. >>> >>> Let me know in case you need any further help. >>> >>> Regards, >>> Gourav >>> >>> On Mon, Nov 27, 2017 at 11:05 AM, Alexander Czech < >>> alexander.cz...@googlemail.com> wrote: >>> >>>> I have a temporary result file ( the 10TB one) that looks like this >>>> I have around 3 billion rows of (url,url_list,language,vector,text). >>>> The bulk of data is in url_list and at the moment I can only guess how >>>> large url_list is. I want to give an ID to every url and then this ID to >>>> every url in url_list to have a ID to ID graph.The columns language,vector >>>> and text only have values for 1% of all rows so they only play a very minor >>>> roll. >>>> >>>> The idea at the moment is to load the URL and URL_list column from the >>>> parquet and give ever row an ID. Then exploded the URL_list and join the >>>> IDs to this on the now exploded rows. After that I drop the URLs from >>>> URL_list column. For the rest of the computation I only load those rows >>>> from the parquet that have values in (language,vector and text) and join >>>> them with the ID table. >>>> >>>> In the end I will create 3 tables: >>>> 1. url, ID >>>> 2. ID, ID >>>> 3. ID,language,vector,text >>>> >>>> Basically there is one very big shuffle going on the rest is not that >>>> heavy. The CPU intense lifting was done before that. >>>> >>>> On Mon, Nov 27, 2017 at 12:03 PM, Alexander Czech < >>>> alexander.cz...@googlemail.com> wrote: >>>> >>>>> I have a temporary result file ( the 10TB one) that looks like this >>>>> I have around 3 billion rows of (url,url_list,language,vector,text). >>>>> The bulk of data is in url_list and at the moment I can only guess how >>>>> large url_list is. I want to give an ID to every url and then this ID to >>>>> every url in url_list to have a ID to ID graph.The columns language,vector >>>>> and text only have values for 1% of all rows so they only play a very >>>>> minor >>>>> roll. >>>>> >>>>> The idea at the moment is to load the URL and URL_list column from the >>>>> parquet and give ever row an ID. Then exploded the URL_list and join the >>>>> IDs to this on the now exploded rows. After that I drop the URLs from >>>>> URL_list column. For the rest of the computation I only load those rows >>>>> from the parquet that have values in (language,vector and text) and join >>>>> them with the ID table. >>>>> >>>>> In the end I will create 3 tables: >>>>> 1. url, ID >>>>> 2. ID, ID >>>>> 3. ID,language,vector,text >>>>> >>>>> Basically there is one very big shuffle going on the rest is not that >>>>> heavy. The CPU intense lifting was done before that. >>>>> >>>>> On Mon, Nov 27, 2017 at 11:01 AM, Georg Heiler < >>>>> georg.kf.hei...@gmail.com> wrote: >>>>> >>>>>> How many columns do you need from the big file? >>>>>> >>>>>> Also how CPU / memory intensive are the computations you want to >>>>>> perform? >>>>>> >>>>>> Alexander Czech <alexander.cz...@googlemail.com> schrieb am Mo. 27. >>>>>> Nov. 2017 um 10:57: >>>>>> >>>>>>> I want to load a 10TB parquet File from S3 and I'm trying to decide >>>>>>> what EC2 instances to use. >>>>>>> >>>>>>> Should I go for instances that in total have a larger memory size >>>>>>> than 10TB? Or is it enough that they have in total enough SSD storage so >>>>>>> that everything can be spilled to disk? >>>>>>> >>>>>>> thanks >>>>>>> >>>>>> >>>>> >>>> >>> >> >