Re: Loading a large parquet file how much memory do I need

Gourav Sengupta Mon, 27 Nov 2017 13:15:07 -0800

Hi,

10 TB in Athena would cost $50. If your data is in Parquet, then it will
cost even less because of columnar striping. So I am genuinely not quite
sure what you are speaking about? Also what do you mean by "I currently
need"? Are you already processing the data?


Since you mentioned that you are a student, may I please ask which
University and College you are studying at? I may provide some additional
information regarding the same.


Regards,
Gourav Sengupta


On Mon, Nov 27, 2017 at 7:55 PM, Alexander Czech <
alexander.cz...@googlemail.com> wrote:

> Hi
> Simply because you have to pay on top of every instance hour. I currently
> need about 4800h of r3.2xlarge EMR takes 0.18$ instance hour so it would be
> 864$ just in EMR costs (spot prices are around 0.12$/h).
>
> Just to stay on topic I thought about getting 40 i2.xlarge instances which
> have about 1TB of combined ram and 32TB of combined SSD space would this be
> enough to load a 10TB parquet or do I need more RAM/Disk spill space?
>
> On Mon, Nov 27, 2017 at 6:06 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> I think that I have mentioned all the required alternatives. However I am
>> quite curious as to how did you conclude that processing using EMR is going
>> to be more expensive than using any other stack. I have been using EMR
>> since last 6 years (almost about the time it came out), and have always
>> found it cheap, reliable, safe and stable (ofcourse its like fire, if you
>> are not careful it can end up burning you financially).
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Mon, Nov 27, 2017 at 12:58 PM, Alexander Czech <
>> alexander.cz...@googlemail.com> wrote:
>>
>>> I don't use EMR I spin my clusters up using flintrock (beeing a student
>>> my budget is slim), my code is writen in pyspark and my data is in the
>>> us-east-1 region (N. Virginia). I will do my best explaining it with tables:
>>>
>>> My input with a size of (10TB) sits in multiple (~150) parquets on S3
>>>
>>> +-----------+--------------------------+-------+------+-------+
>>> |        uri|                 link_list|lang_id|vector|content|
>>> +-----------+--------------------------+-------+------+-------+
>>> |www.123.com|[www.123.com,www.abc.com,]|   null|  null|   null|
>>> |www.abc.com|[www.opq.com,www.456.com,]|   null|  null|   null|
>>> |www.456.com|[www.xyz.com,www.abc.com,]|   null|  null|   null|
>>>
>>>
>>> *(link_list is a  ArrayType(StringType()))*
>>>
>>> Step1 : I only load the uri and link_list columns (but they make up the
>>> bulk of the data). Then every uri is given a unique ID with 
>>> df.withColumn('uri_id',
>>> func.monotonically_increasing_id())
>>> resulting in a dataframe looking like this
>>>
>>> *DF_A*:
>>>
>>> +-----------+--------------------------+-------+
>>> |        uri|                 link_list| uri_id|
>>> +-----------+--------------------------+-------+
>>> |www.123.com|[www.123.com,www.abc.com,]|      1|
>>> |www.abc.com|[www.opq.com,www.456.com,]|      2|
>>> |www.456.com|[www.xyz.com,www.abc.com,]|      3|
>>>
>>> Step 2: I create another dataframe containing only the uri and uri_id which 
>>> is renamed to link_id fields
>>>
>>> *DF_B*:
>>> +-----------+--------+
>>> |        uri| link_id|
>>> +-----------+--------+
>>> |www.123.com|       1|
>>> |www.abc.com|       2|
>>> |www.456.com|       3|
>>>
>>> Step 3: Now I exploded the link_list field in *DF_A* with  
>>> *DF_A*.select("uri_id", func.explode("link_list").alias("link"))
>>> This gives me
>>>
>>> *DF_C*:
>>> +-----------+-------+
>>> |       link| uri_id|
>>> +-----------+-------+
>>> |www.123.com|      1|
>>> |www.abc.com|      1|
>>> |www.opq.com|      2|
>>> |www.456.com|      2|
>>> |www.xyz.com|      3|
>>> |www.abc.com|      3|
>>>
>>>
>>> Lastly I Join DF_B DF_C *DF_C*.join(*DF_B*, *DF_C*.link==*DF_B*.uri, 
>>> "left_outer").drop("uri") Which results in the final dataframe:
>>>
>>>
>>> +-----------+-------+--------+
>>> |       link| uri_id| link_id|
>>> +-----------+-------+--------+
>>> |www.123.com|      1|       1|
>>> |www.abc.com|      1|       2|
>>> |www.opq.com|      2|    null|
>>> |www.456.com|      2|       3|
>>> |www.xyz.com|      3|    null|
>>> |www.abc.com|      3|       1|
>>>
>>> (in code the field link is also dropped but this makes it hopefully more 
>>> intelligible this way)
>>>
>>>
>>> the rest is to just join the uri_id with the lang_id,vector,content rows 
>>> that are not null which is trivial.
>>>
>>> I hope this makes it more readable. If there is an aws service that makes 
>>> it easier for me to deal with the data, since it is basically "just" 
>>> database operations I'm also happy to hear about it.
>>> I got a few days on my hands until the preprocessing is done but I'm not 
>>> sure if the explod in step 3 can be done in another aws service.
>>>
>>> thanks!
>>>
>>>
>>> On Mon, Nov 27, 2017 at 12:32 PM, Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> it would be much simpler in case you just provide two tables with the
>>>> samples of input and output. Going through the verbose text and trying to
>>>> read and figure out what is happening is a bit daunting.
>>>>
>>>> Personally, given that you have your entire data in Parquet, I do not
>>>> think that you will need to have a large cluster size at all. You can do it
>>>> with a small size cluster as well, but depending on the cluster size, you
>>>> might want to create intermediate staging tables or persist the data.
>>>>
>>>> Also it will be of help if you could kindly provide the EMR version
>>>> that you are using.
>>>>
>>>>
>>>> On another note also mention the AWS Region you are in. If Redshift
>>>> Spectrum is available, or you can use Athena, or you can use Presto, then
>>>> running massive aggregates over huge data sets at fraction of cost and at
>>>> least 10x speed may be handy as well.
>>>>
>>>> Let me know in case you need any further help.
>>>>
>>>> Regards,
>>>> Gourav
>>>>
>>>> On Mon, Nov 27, 2017 at 11:05 AM, Alexander Czech <
>>>> alexander.cz...@googlemail.com> wrote:
>>>>
>>>>> I have a temporary result file ( the 10TB one) that looks like this
>>>>> I have around 3 billion rows of (url,url_list,language,vector,text).
>>>>> The bulk of data is in url_list and at the moment I can only guess how
>>>>> large url_list is. I want to give an ID to every url and then this ID to
>>>>> every url in url_list to have a ID to ID graph.The columns language,vector
>>>>> and text only have values for 1% of all rows so they only play a very 
>>>>> minor
>>>>> roll.
>>>>>
>>>>> The idea at the moment is to load the URL and URL_list column from the
>>>>> parquet and give ever row an ID. Then exploded the URL_list and join the
>>>>> IDs to this on the now exploded rows. After that I drop the URLs from
>>>>> URL_list column. For the rest of the computation I only load those rows
>>>>> from the parquet that have values in (language,vector and text) and join
>>>>> them with the ID table.
>>>>>
>>>>> In the end I will create 3 tables:
>>>>> 1. url, ID
>>>>> 2. ID, ID
>>>>> 3. ID,language,vector,text
>>>>>
>>>>> Basically there is one very big shuffle going on the rest is not that
>>>>> heavy. The CPU intense lifting was done before that.
>>>>>
>>>>> On Mon, Nov 27, 2017 at 12:03 PM, Alexander Czech <
>>>>> alexander.cz...@googlemail.com> wrote:
>>>>>
>>>>>> I have a temporary result file ( the 10TB one) that looks like this
>>>>>> I have around 3 billion rows of (url,url_list,language,vector,text).
>>>>>> The bulk of data is in url_list and at the moment I can only guess how
>>>>>> large url_list is. I want to give an ID to every url and then this ID to
>>>>>> every url in url_list to have a ID to ID graph.The columns 
>>>>>> language,vector
>>>>>> and text only have values for 1% of all rows so they only play a very 
>>>>>> minor
>>>>>> roll.
>>>>>>
>>>>>> The idea at the moment is to load the URL and URL_list column from
>>>>>> the parquet and give ever row an ID. Then exploded the URL_list and join
>>>>>> the IDs to this on the now exploded rows. After that I drop the URLs from
>>>>>> URL_list column. For the rest of the computation I only load those rows
>>>>>> from the parquet that have values in (language,vector and text) and join
>>>>>> them with the ID table.
>>>>>>
>>>>>> In the end I will create 3 tables:
>>>>>> 1. url, ID
>>>>>> 2. ID, ID
>>>>>> 3. ID,language,vector,text
>>>>>>
>>>>>> Basically there is one very big shuffle going on the rest is not that
>>>>>> heavy. The CPU intense lifting was done before that.
>>>>>>
>>>>>> On Mon, Nov 27, 2017 at 11:01 AM, Georg Heiler <
>>>>>> georg.kf.hei...@gmail.com> wrote:
>>>>>>
>>>>>>> How many columns do you need from the big file?
>>>>>>>
>>>>>>> Also how CPU / memory intensive are the computations you want to
>>>>>>> perform?
>>>>>>>
>>>>>>> Alexander Czech <alexander.cz...@googlemail.com> schrieb am Mo. 27.
>>>>>>> Nov. 2017 um 10:57:
>>>>>>>
>>>>>>>> I want to load a 10TB parquet File from S3 and I'm trying to decide
>>>>>>>> what EC2 instances to use.
>>>>>>>>
>>>>>>>> Should I go for instances that in total have a larger memory size
>>>>>>>> than 10TB? Or is it enough that they have in total enough SSD storage 
>>>>>>>> so
>>>>>>>> that everything can be spilled to disk?
>>>>>>>>
>>>>>>>> thanks
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Loading a large parquet file how much memory do I need

Reply via email to