Re: Best way to process this dataset

2018-06-19 Thread Raymond Xie
Thank you, that works.


**
*Sincerely yours,*


*Raymond*

On Tue, Jun 19, 2018 at 4:36 PM, Nicolas Paris  wrote:

> Hi Raymond
>
> Spark works well on single machine too, since it benefits from multiple
> core.
> The csv parser is based on univocity and you might use the
> "spark.read.csc" syntax instead of using the rdd api;
>
> From my experience, this will better than any other csv  parser
>
> 2018-06-19 16:43 GMT+02:00 Raymond Xie :
>
>> Thank you Matteo, Askash and Georg:
>>
>> I am attempting to get some stats first, the data is like:
>>
>> 1,4152983,2355072,pv,1511871096
>>
>> I like to find out the count of Key of (UserID, Behavior Type)
>>
>> val bh_count = 
>> sc.textFile("C:\\RXIE\\Learning\\Data\\Alibaba\\UserBehavior\\UserBehavior.csv").map(_.split(",")).map(x
>>  => ((x(0).toInt,x(3)),1)).groupByKey()
>>
>> This shows me:
>> scala> val first = bh_count.first
>> [Stage 1:>  (0 +
>> 1) / 1]2018-06-19 10:41:19 WARN  Executor:66 - Managed memory leak
>> detected; size = 15848112 bytes, TID = 110
>> first: ((Int, String), Iterable[Int]) = ((878310,pv),CompactBuffer(1, 1,
>> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
>> 1, 1))
>>
>>
>> *Note this environment is: Windows 7 with 32GB RAM. (I am firstly running
>> it in Windows where I have more RAM instead of Ubuntu so the env differs to
>> what I said in the original email)*
>> *Dataset is 3.6GB*
>>
>> *Thank you very much.*
>> **
>> *Sincerely yours,*
>>
>>
>> *Raymond*
>>
>> On Tue, Jun 19, 2018 at 4:04 AM, Matteo Cossu  wrote:
>>
>>> Single machine? Any other framework will perform better than Spark
>>>
>>> On Tue, 19 Jun 2018 at 09:40, Aakash Basu 
>>> wrote:
>>>
 Georg, just asking, can Pandas handle such a big dataset? If that data
 is further passed into using any of the sklearn modules?

 On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler <
 georg.kf.hei...@gmail.com> wrote:

> use pandas or dask
>
> If you do want to use spark store the dataset as parquet / orc. And
> then continue to perform analytical queries on that dataset.
>
> Raymond Xie  schrieb am Di., 19. Juni 2018 um
> 04:29 Uhr:
>
>> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my
>> environment is 20GB ssd harddisk and 2GB RAM.
>>
>> The dataset comes with
>> User ID: 987,994
>> Item ID: 4,162,024
>> Category ID: 9,439
>> Behavior type ('pv', 'buy', 'cart', 'fav')
>> Unix Timestamp: span between November 25 to December 03, 2017
>>
>> I would like to hear any suggestion from you on how should I process
>> the dataset with my current environment.
>>
>> Thank you.
>>
>> **
>> *Sincerely yours,*
>>
>>
>> *Raymond*
>>
>

>>
>


Re: Best way to process this dataset

2018-06-19 Thread Nicolas Paris
Hi Raymond

Spark works well on single machine too, since it benefits from multiple
core.
The csv parser is based on univocity and you might use the
"spark.read.csc" syntax instead of using the rdd api;

>From my experience, this will better than any other csv  parser

2018-06-19 16:43 GMT+02:00 Raymond Xie :

> Thank you Matteo, Askash and Georg:
>
> I am attempting to get some stats first, the data is like:
>
> 1,4152983,2355072,pv,1511871096
>
> I like to find out the count of Key of (UserID, Behavior Type)
>
> val bh_count = 
> sc.textFile("C:\\RXIE\\Learning\\Data\\Alibaba\\UserBehavior\\UserBehavior.csv").map(_.split(",")).map(x
>  => ((x(0).toInt,x(3)),1)).groupByKey()
>
> This shows me:
> scala> val first = bh_count.first
> [Stage 1:>  (0 +
> 1) / 1]2018-06-19 10:41:19 WARN  Executor:66 - Managed memory leak
> detected; size = 15848112 bytes, TID = 110
> first: ((Int, String), Iterable[Int]) = ((878310,pv),CompactBuffer(1, 1,
> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1))
>
>
> *Note this environment is: Windows 7 with 32GB RAM. (I am firstly running
> it in Windows where I have more RAM instead of Ubuntu so the env differs to
> what I said in the original email)*
> *Dataset is 3.6GB*
>
> *Thank you very much.*
> **
> *Sincerely yours,*
>
>
> *Raymond*
>
> On Tue, Jun 19, 2018 at 4:04 AM, Matteo Cossu  wrote:
>
>> Single machine? Any other framework will perform better than Spark
>>
>> On Tue, 19 Jun 2018 at 09:40, Aakash Basu 
>> wrote:
>>
>>> Georg, just asking, can Pandas handle such a big dataset? If that data
>>> is further passed into using any of the sklearn modules?
>>>
>>> On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler <
>>> georg.kf.hei...@gmail.com> wrote:
>>>
 use pandas or dask

 If you do want to use spark store the dataset as parquet / orc. And
 then continue to perform analytical queries on that dataset.

 Raymond Xie  schrieb am Di., 19. Juni 2018 um
 04:29 Uhr:

> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my
> environment is 20GB ssd harddisk and 2GB RAM.
>
> The dataset comes with
> User ID: 987,994
> Item ID: 4,162,024
> Category ID: 9,439
> Behavior type ('pv', 'buy', 'cart', 'fav')
> Unix Timestamp: span between November 25 to December 03, 2017
>
> I would like to hear any suggestion from you on how should I process
> the dataset with my current environment.
>
> Thank you.
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>

>>>
>


Re: Best way to process this dataset

2018-06-19 Thread Raymond Xie
Thank you Matteo, Askash and Georg:

I am attempting to get some stats first, the data is like:

1,4152983,2355072,pv,1511871096

I like to find out the count of Key of (UserID, Behavior Type)

val bh_count = 
sc.textFile("C:\\RXIE\\Learning\\Data\\Alibaba\\UserBehavior\\UserBehavior.csv").map(_.split(",")).map(x
=> ((x(0).toInt,x(3)),1)).groupByKey()

This shows me:
scala> val first = bh_count.first
[Stage 1:>  (0 + 1)
/ 1]2018-06-19 10:41:19 WARN  Executor:66 - Managed memory leak detected;
size = 15848112 bytes, TID = 110
first: ((Int, String), Iterable[Int]) = ((878310,pv),CompactBuffer(1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1))


*Note this environment is: Windows 7 with 32GB RAM. (I am firstly running
it in Windows where I have more RAM instead of Ubuntu so the env differs to
what I said in the original email)*
*Dataset is 3.6GB*

*Thank you very much.*
**
*Sincerely yours,*


*Raymond*

On Tue, Jun 19, 2018 at 4:04 AM, Matteo Cossu  wrote:

> Single machine? Any other framework will perform better than Spark
>
> On Tue, 19 Jun 2018 at 09:40, Aakash Basu 
> wrote:
>
>> Georg, just asking, can Pandas handle such a big dataset? If that data is
>> further passed into using any of the sklearn modules?
>>
>> On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler > > wrote:
>>
>>> use pandas or dask
>>>
>>> If you do want to use spark store the dataset as parquet / orc. And then
>>> continue to perform analytical queries on that dataset.
>>>
>>> Raymond Xie  schrieb am Di., 19. Juni 2018 um
>>> 04:29 Uhr:
>>>
 I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my
 environment is 20GB ssd harddisk and 2GB RAM.

 The dataset comes with
 User ID: 987,994
 Item ID: 4,162,024
 Category ID: 9,439
 Behavior type ('pv', 'buy', 'cart', 'fav')
 Unix Timestamp: span between November 25 to December 03, 2017

 I would like to hear any suggestion from you on how should I process
 the dataset with my current environment.

 Thank you.

 **
 *Sincerely yours,*


 *Raymond*

>>>
>>


Re: Best way to process this dataset

2018-06-19 Thread Matteo Cossu
Single machine? Any other framework will perform better than Spark

On Tue, 19 Jun 2018 at 09:40, Aakash Basu 
wrote:

> Georg, just asking, can Pandas handle such a big dataset? If that data is
> further passed into using any of the sklearn modules?
>
> On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler 
> wrote:
>
>> use pandas or dask
>>
>> If you do want to use spark store the dataset as parquet / orc. And then
>> continue to perform analytical queries on that dataset.
>>
>> Raymond Xie  schrieb am Di., 19. Juni 2018 um
>> 04:29 Uhr:
>>
>>> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment
>>> is 20GB ssd harddisk and 2GB RAM.
>>>
>>> The dataset comes with
>>> User ID: 987,994
>>> Item ID: 4,162,024
>>> Category ID: 9,439
>>> Behavior type ('pv', 'buy', 'cart', 'fav')
>>> Unix Timestamp: span between November 25 to December 03, 2017
>>>
>>> I would like to hear any suggestion from you on how should I process the
>>> dataset with my current environment.
>>>
>>> Thank you.
>>>
>>> **
>>> *Sincerely yours,*
>>>
>>>
>>> *Raymond*
>>>
>>
>


Re: Best way to process this dataset

2018-06-19 Thread Aakash Basu
Georg, just asking, can Pandas handle such a big dataset? If that data is
further passed into using any of the sklearn modules?

On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler 
wrote:

> use pandas or dask
>
> If you do want to use spark store the dataset as parquet / orc. And then
> continue to perform analytical queries on that dataset.
>
> Raymond Xie  schrieb am Di., 19. Juni 2018 um
> 04:29 Uhr:
>
>> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment
>> is 20GB ssd harddisk and 2GB RAM.
>>
>> The dataset comes with
>> User ID: 987,994
>> Item ID: 4,162,024
>> Category ID: 9,439
>> Behavior type ('pv', 'buy', 'cart', 'fav')
>> Unix Timestamp: span between November 25 to December 03, 2017
>>
>> I would like to hear any suggestion from you on how should I process the
>> dataset with my current environment.
>>
>> Thank you.
>>
>> **
>> *Sincerely yours,*
>>
>>
>> *Raymond*
>>
>


Re: Best way to process this dataset

2018-06-18 Thread Georg Heiler
use pandas or dask

If you do want to use spark store the dataset as parquet / orc. And then
continue to perform analytical queries on that dataset.

Raymond Xie  schrieb am Di., 19. Juni 2018 um
04:29 Uhr:

> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment
> is 20GB ssd harddisk and 2GB RAM.
>
> The dataset comes with
> User ID: 987,994
> Item ID: 4,162,024
> Category ID: 9,439
> Behavior type ('pv', 'buy', 'cart', 'fav')
> Unix Timestamp: span between November 25 to December 03, 2017
>
> I would like to hear any suggestion from you on how should I process the
> dataset with my current environment.
>
> Thank you.
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>


Best way to process this dataset

2018-06-18 Thread Raymond Xie
I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment is
20GB ssd harddisk and 2GB RAM.

The dataset comes with
User ID: 987,994
Item ID: 4,162,024
Category ID: 9,439
Behavior type ('pv', 'buy', 'cart', 'fav')
Unix Timestamp: span between November 25 to December 03, 2017

I would like to hear any suggestion from you on how should I process the
dataset with my current environment.

Thank you.

**
*Sincerely yours,*


*Raymond*