I have a temporary result file ( the 10TB one) that looks like this
I have around 3 billion rows of (url,url_list,language,vector,text). The
bulk of data is in url_list and at the moment I can only guess how large
url_list is. I want to give an ID to every url and then this ID to every
url in url_list to have a ID to ID graph.The columns language,vector and
text only have values for 1% of all rows so they only play a very minor
roll.

The idea at the moment is to load the URL and URL_list column from the
parquet and give ever row an ID. Then exploded the URL_list and join the
IDs to this on the now exploded rows. After that I drop the URLs from
URL_list column. For the rest of the computation I only load those rows
from the parquet that have values in (language,vector and text) and join
them with the ID table.

In the end I will create 3 tables:
1. url, ID
2. ID, ID
3. ID,language,vector,text

Basically there is one very big shuffle going on the rest is not that
heavy. The CPU intense lifting was done before that.

On Mon, Nov 27, 2017 at 12:03 PM, Alexander Czech <
alexander.cz...@googlemail.com> wrote:

> I have a temporary result file ( the 10TB one) that looks like this
> I have around 3 billion rows of (url,url_list,language,vector,text). The
> bulk of data is in url_list and at the moment I can only guess how large
> url_list is. I want to give an ID to every url and then this ID to every
> url in url_list to have a ID to ID graph.The columns language,vector and
> text only have values for 1% of all rows so they only play a very minor
> roll.
>
> The idea at the moment is to load the URL and URL_list column from the
> parquet and give ever row an ID. Then exploded the URL_list and join the
> IDs to this on the now exploded rows. After that I drop the URLs from
> URL_list column. For the rest of the computation I only load those rows
> from the parquet that have values in (language,vector and text) and join
> them with the ID table.
>
> In the end I will create 3 tables:
> 1. url, ID
> 2. ID, ID
> 3. ID,language,vector,text
>
> Basically there is one very big shuffle going on the rest is not that
> heavy. The CPU intense lifting was done before that.
>
> On Mon, Nov 27, 2017 at 11:01 AM, Georg Heiler <georg.kf.hei...@gmail.com>
> wrote:
>
>> How many columns do you need from the big file?
>>
>> Also how CPU / memory intensive are the computations you want to perform?
>>
>> Alexander Czech <alexander.cz...@googlemail.com> schrieb am Mo. 27. Nov.
>> 2017 um 10:57:
>>
>>> I want to load a 10TB parquet File from S3 and I'm trying to decide what
>>> EC2 instances to use.
>>>
>>> Should I go for instances that in total have a larger memory size than
>>> 10TB? Or is it enough that they have in total enough SSD storage so that
>>> everything can be spilled to disk?
>>>
>>> thanks
>>>
>>
>

Reply via email to