Re: Best way to process this dataset
Thank you, that works. ** *Sincerely yours,* *Raymond* On Tue, Jun 19, 2018 at 4:36 PM, Nicolas Paris wrote: > Hi Raymond > > Spark works well on single machine too, since it benefits from multiple > core. > The csv parser is based on univocity and you might use the > "spark.read.csc" syntax instead of using the rdd api; > > From my experience, this will better than any other csv parser > > 2018-06-19 16:43 GMT+02:00 Raymond Xie : > >> Thank you Matteo, Askash and Georg: >> >> I am attempting to get some stats first, the data is like: >> >> 1,4152983,2355072,pv,1511871096 >> >> I like to find out the count of Key of (UserID, Behavior Type) >> >> val bh_count = >> sc.textFile("C:\\RXIE\\Learning\\Data\\Alibaba\\UserBehavior\\UserBehavior.csv").map(_.split(",")).map(x >> => ((x(0).toInt,x(3)),1)).groupByKey() >> >> This shows me: >> scala> val first = bh_count.first >> [Stage 1:> (0 + >> 1) / 1]2018-06-19 10:41:19 WARN Executor:66 - Managed memory leak >> detected; size = 15848112 bytes, TID = 110 >> first: ((Int, String), Iterable[Int]) = ((878310,pv),CompactBuffer(1, 1, >> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, >> 1, 1)) >> >> >> *Note this environment is: Windows 7 with 32GB RAM. (I am firstly running >> it in Windows where I have more RAM instead of Ubuntu so the env differs to >> what I said in the original email)* >> *Dataset is 3.6GB* >> >> *Thank you very much.* >> ** >> *Sincerely yours,* >> >> >> *Raymond* >> >> On Tue, Jun 19, 2018 at 4:04 AM, Matteo Cossu wrote: >> >>> Single machine? Any other framework will perform better than Spark >>> >>> On Tue, 19 Jun 2018 at 09:40, Aakash Basu >>> wrote: >>> Georg, just asking, can Pandas handle such a big dataset? If that data is further passed into using any of the sklearn modules? On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler < georg.kf.hei...@gmail.com> wrote: > use pandas or dask > > If you do want to use spark store the dataset as parquet / orc. And > then continue to perform analytical queries on that dataset. > > Raymond Xie schrieb am Di., 19. Juni 2018 um > 04:29 Uhr: > >> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my >> environment is 20GB ssd harddisk and 2GB RAM. >> >> The dataset comes with >> User ID: 987,994 >> Item ID: 4,162,024 >> Category ID: 9,439 >> Behavior type ('pv', 'buy', 'cart', 'fav') >> Unix Timestamp: span between November 25 to December 03, 2017 >> >> I would like to hear any suggestion from you on how should I process >> the dataset with my current environment. >> >> Thank you. >> >> ** >> *Sincerely yours,* >> >> >> *Raymond* >> > >> >
Re: Best way to process this dataset
Hi Raymond Spark works well on single machine too, since it benefits from multiple core. The csv parser is based on univocity and you might use the "spark.read.csc" syntax instead of using the rdd api; >From my experience, this will better than any other csv parser 2018-06-19 16:43 GMT+02:00 Raymond Xie : > Thank you Matteo, Askash and Georg: > > I am attempting to get some stats first, the data is like: > > 1,4152983,2355072,pv,1511871096 > > I like to find out the count of Key of (UserID, Behavior Type) > > val bh_count = > sc.textFile("C:\\RXIE\\Learning\\Data\\Alibaba\\UserBehavior\\UserBehavior.csv").map(_.split(",")).map(x > => ((x(0).toInt,x(3)),1)).groupByKey() > > This shows me: > scala> val first = bh_count.first > [Stage 1:> (0 + > 1) / 1]2018-06-19 10:41:19 WARN Executor:66 - Managed memory leak > detected; size = 15848112 bytes, TID = 110 > first: ((Int, String), Iterable[Int]) = ((878310,pv),CompactBuffer(1, 1, > 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, > 1, 1)) > > > *Note this environment is: Windows 7 with 32GB RAM. (I am firstly running > it in Windows where I have more RAM instead of Ubuntu so the env differs to > what I said in the original email)* > *Dataset is 3.6GB* > > *Thank you very much.* > ** > *Sincerely yours,* > > > *Raymond* > > On Tue, Jun 19, 2018 at 4:04 AM, Matteo Cossu wrote: > >> Single machine? Any other framework will perform better than Spark >> >> On Tue, 19 Jun 2018 at 09:40, Aakash Basu >> wrote: >> >>> Georg, just asking, can Pandas handle such a big dataset? If that data >>> is further passed into using any of the sklearn modules? >>> >>> On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler < >>> georg.kf.hei...@gmail.com> wrote: >>> use pandas or dask If you do want to use spark store the dataset as parquet / orc. And then continue to perform analytical queries on that dataset. Raymond Xie schrieb am Di., 19. Juni 2018 um 04:29 Uhr: > I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my > environment is 20GB ssd harddisk and 2GB RAM. > > The dataset comes with > User ID: 987,994 > Item ID: 4,162,024 > Category ID: 9,439 > Behavior type ('pv', 'buy', 'cart', 'fav') > Unix Timestamp: span between November 25 to December 03, 2017 > > I would like to hear any suggestion from you on how should I process > the dataset with my current environment. > > Thank you. > > ** > *Sincerely yours,* > > > *Raymond* > >>> >
Re: Best way to process this dataset
Thank you Matteo, Askash and Georg: I am attempting to get some stats first, the data is like: 1,4152983,2355072,pv,1511871096 I like to find out the count of Key of (UserID, Behavior Type) val bh_count = sc.textFile("C:\\RXIE\\Learning\\Data\\Alibaba\\UserBehavior\\UserBehavior.csv").map(_.split(",")).map(x => ((x(0).toInt,x(3)),1)).groupByKey() This shows me: scala> val first = bh_count.first [Stage 1:> (0 + 1) / 1]2018-06-19 10:41:19 WARN Executor:66 - Managed memory leak detected; size = 15848112 bytes, TID = 110 first: ((Int, String), Iterable[Int]) = ((878310,pv),CompactBuffer(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)) *Note this environment is: Windows 7 with 32GB RAM. (I am firstly running it in Windows where I have more RAM instead of Ubuntu so the env differs to what I said in the original email)* *Dataset is 3.6GB* *Thank you very much.* ** *Sincerely yours,* *Raymond* On Tue, Jun 19, 2018 at 4:04 AM, Matteo Cossu wrote: > Single machine? Any other framework will perform better than Spark > > On Tue, 19 Jun 2018 at 09:40, Aakash Basu > wrote: > >> Georg, just asking, can Pandas handle such a big dataset? If that data is >> further passed into using any of the sklearn modules? >> >> On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler > > wrote: >> >>> use pandas or dask >>> >>> If you do want to use spark store the dataset as parquet / orc. And then >>> continue to perform analytical queries on that dataset. >>> >>> Raymond Xie schrieb am Di., 19. Juni 2018 um >>> 04:29 Uhr: >>> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment is 20GB ssd harddisk and 2GB RAM. The dataset comes with User ID: 987,994 Item ID: 4,162,024 Category ID: 9,439 Behavior type ('pv', 'buy', 'cart', 'fav') Unix Timestamp: span between November 25 to December 03, 2017 I would like to hear any suggestion from you on how should I process the dataset with my current environment. Thank you. ** *Sincerely yours,* *Raymond* >>> >>
Re: Best way to process this dataset
Single machine? Any other framework will perform better than Spark On Tue, 19 Jun 2018 at 09:40, Aakash Basu wrote: > Georg, just asking, can Pandas handle such a big dataset? If that data is > further passed into using any of the sklearn modules? > > On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler > wrote: > >> use pandas or dask >> >> If you do want to use spark store the dataset as parquet / orc. And then >> continue to perform analytical queries on that dataset. >> >> Raymond Xie schrieb am Di., 19. Juni 2018 um >> 04:29 Uhr: >> >>> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment >>> is 20GB ssd harddisk and 2GB RAM. >>> >>> The dataset comes with >>> User ID: 987,994 >>> Item ID: 4,162,024 >>> Category ID: 9,439 >>> Behavior type ('pv', 'buy', 'cart', 'fav') >>> Unix Timestamp: span between November 25 to December 03, 2017 >>> >>> I would like to hear any suggestion from you on how should I process the >>> dataset with my current environment. >>> >>> Thank you. >>> >>> ** >>> *Sincerely yours,* >>> >>> >>> *Raymond* >>> >> >
Re: Best way to process this dataset
Georg, just asking, can Pandas handle such a big dataset? If that data is further passed into using any of the sklearn modules? On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler wrote: > use pandas or dask > > If you do want to use spark store the dataset as parquet / orc. And then > continue to perform analytical queries on that dataset. > > Raymond Xie schrieb am Di., 19. Juni 2018 um > 04:29 Uhr: > >> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment >> is 20GB ssd harddisk and 2GB RAM. >> >> The dataset comes with >> User ID: 987,994 >> Item ID: 4,162,024 >> Category ID: 9,439 >> Behavior type ('pv', 'buy', 'cart', 'fav') >> Unix Timestamp: span between November 25 to December 03, 2017 >> >> I would like to hear any suggestion from you on how should I process the >> dataset with my current environment. >> >> Thank you. >> >> ** >> *Sincerely yours,* >> >> >> *Raymond* >> >
Re: Best way to process this dataset
use pandas or dask If you do want to use spark store the dataset as parquet / orc. And then continue to perform analytical queries on that dataset. Raymond Xie schrieb am Di., 19. Juni 2018 um 04:29 Uhr: > I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment > is 20GB ssd harddisk and 2GB RAM. > > The dataset comes with > User ID: 987,994 > Item ID: 4,162,024 > Category ID: 9,439 > Behavior type ('pv', 'buy', 'cart', 'fav') > Unix Timestamp: span between November 25 to December 03, 2017 > > I would like to hear any suggestion from you on how should I process the > dataset with my current environment. > > Thank you. > > ** > *Sincerely yours,* > > > *Raymond* >
Best way to process this dataset
I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment is 20GB ssd harddisk and 2GB RAM. The dataset comes with User ID: 987,994 Item ID: 4,162,024 Category ID: 9,439 Behavior type ('pv', 'buy', 'cart', 'fav') Unix Timestamp: span between November 25 to December 03, 2017 I would like to hear any suggestion from you on how should I process the dataset with my current environment. Thank you. ** *Sincerely yours,* *Raymond*