Re: SparkR DataFrame , Out of memory exception for very small file.
Hi Jeff, This is only part of the actual code. My questions are mentioned in comments near the code. SALES<- SparkR::sql(hiveContext, "select * from sales") PRICING<- SparkR::sql(hiveContext, "select * from pricing") ## renaming of columns ## #sales file# # Is this right ??? Do we have to create a new DF for every column Addition to the original DF. # And if we do that , then what about the older DF , they will also take memory ? names(SALES)[which(names(SALES)=="div_no")]<-"DIV_NO" names(SALES)[which(names(SALES)=="store_no")]<-"STORE_NO" #pricing file# names(PRICING)[which(names(PRICING)=="price_type_cd")]<-"PRICE_TYPE" names(PRICING)[which(names(PRICING)=="price_amt")]<-"PRICE_AMT" registerTempTable(SALES,"sales") registerTempTable(PRICING,"pricing") #merging sales and pricing file# merg_sales_pricing<- SparkR::sql(hiveContext,"select .") head(merg_sales_pricing) Thanks, Vipul On 23 November 2015 at 14:52, Jeff Zhang <zjf...@gmail.com> wrote: > If possible, could you share your code ? What kind of operation are you > doing on the dataframe ? > > On Mon, Nov 23, 2015 at 5:10 PM, Vipul Rai <vipulrai8...@gmail.com> wrote: > >> Hi Zeff, >> >> Thanks for the reply, but could you tell me why is it taking so much time. >> What could be wrong , also when I remove the DataFrame from memory using >> rm(). >> It does not clear the memory but the object is deleted. >> >> Also , What about the R functions which are not supported in SparkR. >> Like ddply ?? >> >> How to access the nth ROW of SparkR DataFrame. >> >> Regards, >> Vipul >> >> On 23 November 2015 at 14:25, Jeff Zhang <zjf...@gmail.com> wrote: >> >>> >>> Do I need to create a new DataFrame for every update to the >>> DataFrame like >>> addition of new column or need to update the original sales DataFrame. >>> >>> Yes, DataFrame is immutable, and every mutation of DataFrame will >>> produce a new DataFrame. >>> >>> >>> >>> On Mon, Nov 23, 2015 at 4:44 PM, Vipul Rai <vipulrai8...@gmail.com> >>> wrote: >>> >>>> Hello Rui, >>>> >>>> Sorry , What I meant was the resultant of the original dataframe to >>>> which a new column was added gives a new DataFrame. >>>> >>>> Please check this for more >>>> >>>> https://spark.apache.org/docs/1.5.1/api/R/index.html >>>> >>>> Check for >>>> WithColumn >>>> >>>> >>>> Thanks, >>>> Vipul >>>> >>>> >>>> On 23 November 2015 at 12:42, Sun, Rui <rui@intel.com> wrote: >>>> >>>>> Vipul, >>>>> >>>>> Not sure if I understand your question. DataFrame is immutable. You >>>>> can't update a DataFrame. >>>>> >>>>> Could you paste some log info for the OOM error? >>>>> >>>>> -Original Message- >>>>> From: vipulrai [mailto:vipulrai8...@gmail.com] >>>>> Sent: Friday, November 20, 2015 12:11 PM >>>>> To: user@spark.apache.org >>>>> Subject: SparkR DataFrame , Out of memory exception for very small >>>>> file. >>>>> >>>>> Hi Users, >>>>> >>>>> I have a general doubt regarding DataFrames in SparkR. >>>>> >>>>> I am trying to read a file from Hive and it gets created as DataFrame. >>>>> >>>>> sqlContext <- sparkRHive.init(sc) >>>>> >>>>> #DF >>>>> sales <- read.df(sqlContext, "hdfs://sample.csv", header ='true', >>>>> source = "com.databricks.spark.csv", >>>>> inferSchema='true') >>>>> >>>>> registerTempTable(sales,"Sales") >>>>> >>>>> Do I need to create a new DataFrame for every update to the DataFrame >>>>> like addition of new column or need to update the original sales >>>>> DataFrame. >>>>> >>>>> sales1<- SparkR::sql(sqlContext,"Select a.* , 607 as C1 from Sales as >>>>> a") >>>>> >>>>> >>>>> Please help me with this , as the orignal file is only 20MB but it >>>>> throws out of memory exception on a cluster of 4GB Master and Two workers >>>>> of 4GB each. >>>>> >>>>> Also, what is the logic with DataFrame do I need to register and drop >>>>> tempTable after every update?? >>>>> >>>>> Thanks, >>>>> Vipul >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-DataFrame-Out-of-memory-exception-for-very-small-file-tp25435.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>>>> - >>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >>>>> additional commands, e-mail: user-h...@spark.apache.org >>>>> >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Vipul Rai >>>> www.vipulrai.me >>>> +91-8892598819 >>>> <http://in.linkedin.com/in/vipulrai/> >>>> >>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> >> >> -- >> Regards, >> Vipul Rai >> www.vipulrai.me >> +91-8892598819 >> <http://in.linkedin.com/in/vipulrai/> >> > > > > -- > Best Regards > > Jeff Zhang > -- Regards, Vipul Rai www.vipulrai.me +91-8892598819 <http://in.linkedin.com/in/vipulrai/>
Re: SparkR DataFrame , Out of memory exception for very small file.
>>> Do I need to create a new DataFrame for every update to the DataFrame like addition of new column or need to update the original sales DataFrame. Yes, DataFrame is immutable, and every mutation of DataFrame will produce a new DataFrame. On Mon, Nov 23, 2015 at 4:44 PM, Vipul Rai <vipulrai8...@gmail.com> wrote: > Hello Rui, > > Sorry , What I meant was the resultant of the original dataframe to which > a new column was added gives a new DataFrame. > > Please check this for more > > https://spark.apache.org/docs/1.5.1/api/R/index.html > > Check for > WithColumn > > > Thanks, > Vipul > > > On 23 November 2015 at 12:42, Sun, Rui <rui@intel.com> wrote: > >> Vipul, >> >> Not sure if I understand your question. DataFrame is immutable. You can't >> update a DataFrame. >> >> Could you paste some log info for the OOM error? >> >> -Original Message- >> From: vipulrai [mailto:vipulrai8...@gmail.com] >> Sent: Friday, November 20, 2015 12:11 PM >> To: user@spark.apache.org >> Subject: SparkR DataFrame , Out of memory exception for very small file. >> >> Hi Users, >> >> I have a general doubt regarding DataFrames in SparkR. >> >> I am trying to read a file from Hive and it gets created as DataFrame. >> >> sqlContext <- sparkRHive.init(sc) >> >> #DF >> sales <- read.df(sqlContext, "hdfs://sample.csv", header ='true', >> source = "com.databricks.spark.csv", inferSchema='true') >> >> registerTempTable(sales,"Sales") >> >> Do I need to create a new DataFrame for every update to the DataFrame >> like addition of new column or need to update the original sales DataFrame. >> >> sales1<- SparkR::sql(sqlContext,"Select a.* , 607 as C1 from Sales as a") >> >> >> Please help me with this , as the orignal file is only 20MB but it throws >> out of memory exception on a cluster of 4GB Master and Two workers of 4GB >> each. >> >> Also, what is the logic with DataFrame do I need to register and drop >> tempTable after every update?? >> >> Thanks, >> Vipul >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-DataFrame-Out-of-memory-exception-for-very-small-file-tp25435.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional >> commands, e-mail: user-h...@spark.apache.org >> >> > > > -- > Regards, > Vipul Rai > www.vipulrai.me > +91-8892598819 > <http://in.linkedin.com/in/vipulrai/> > -- Best Regards Jeff Zhang
Re: SparkR DataFrame , Out of memory exception for very small file.
Hello Rui, Sorry , What I meant was the resultant of the original dataframe to which a new column was added gives a new DataFrame. Please check this for more https://spark.apache.org/docs/1.5.1/api/R/index.html Check for WithColumn Thanks, Vipul On 23 November 2015 at 12:42, Sun, Rui <rui@intel.com> wrote: > Vipul, > > Not sure if I understand your question. DataFrame is immutable. You can't > update a DataFrame. > > Could you paste some log info for the OOM error? > > -Original Message- > From: vipulrai [mailto:vipulrai8...@gmail.com] > Sent: Friday, November 20, 2015 12:11 PM > To: user@spark.apache.org > Subject: SparkR DataFrame , Out of memory exception for very small file. > > Hi Users, > > I have a general doubt regarding DataFrames in SparkR. > > I am trying to read a file from Hive and it gets created as DataFrame. > > sqlContext <- sparkRHive.init(sc) > > #DF > sales <- read.df(sqlContext, "hdfs://sample.csv", header ='true', > source = "com.databricks.spark.csv", inferSchema='true') > > registerTempTable(sales,"Sales") > > Do I need to create a new DataFrame for every update to the DataFrame like > addition of new column or need to update the original sales DataFrame. > > sales1<- SparkR::sql(sqlContext,"Select a.* , 607 as C1 from Sales as a") > > > Please help me with this , as the orignal file is only 20MB but it throws > out of memory exception on a cluster of 4GB Master and Two workers of 4GB > each. > > Also, what is the logic with DataFrame do I need to register and drop > tempTable after every update?? > > Thanks, > Vipul > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-DataFrame-Out-of-memory-exception-for-very-small-file-tp25435.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > commands, e-mail: user-h...@spark.apache.org > > -- Regards, Vipul Rai www.vipulrai.me +91-8892598819 <http://in.linkedin.com/in/vipulrai/>
Re: SparkR DataFrame , Out of memory exception for very small file.
Hi Zeff, Thanks for the reply, but could you tell me why is it taking so much time. What could be wrong , also when I remove the DataFrame from memory using rm(). It does not clear the memory but the object is deleted. Also , What about the R functions which are not supported in SparkR. Like ddply ?? How to access the nth ROW of SparkR DataFrame. Regards, Vipul On 23 November 2015 at 14:25, Jeff Zhang <zjf...@gmail.com> wrote: > >>> Do I need to create a new DataFrame for every update to the DataFrame > like > addition of new column or need to update the original sales DataFrame. > > Yes, DataFrame is immutable, and every mutation of DataFrame will produce > a new DataFrame. > > > > On Mon, Nov 23, 2015 at 4:44 PM, Vipul Rai <vipulrai8...@gmail.com> wrote: > >> Hello Rui, >> >> Sorry , What I meant was the resultant of the original dataframe to which >> a new column was added gives a new DataFrame. >> >> Please check this for more >> >> https://spark.apache.org/docs/1.5.1/api/R/index.html >> >> Check for >> WithColumn >> >> >> Thanks, >> Vipul >> >> >> On 23 November 2015 at 12:42, Sun, Rui <rui@intel.com> wrote: >> >>> Vipul, >>> >>> Not sure if I understand your question. DataFrame is immutable. You >>> can't update a DataFrame. >>> >>> Could you paste some log info for the OOM error? >>> >>> -Original Message- >>> From: vipulrai [mailto:vipulrai8...@gmail.com] >>> Sent: Friday, November 20, 2015 12:11 PM >>> To: user@spark.apache.org >>> Subject: SparkR DataFrame , Out of memory exception for very small file. >>> >>> Hi Users, >>> >>> I have a general doubt regarding DataFrames in SparkR. >>> >>> I am trying to read a file from Hive and it gets created as DataFrame. >>> >>> sqlContext <- sparkRHive.init(sc) >>> >>> #DF >>> sales <- read.df(sqlContext, "hdfs://sample.csv", header ='true', >>> source = "com.databricks.spark.csv", inferSchema='true') >>> >>> registerTempTable(sales,"Sales") >>> >>> Do I need to create a new DataFrame for every update to the DataFrame >>> like addition of new column or need to update the original sales DataFrame. >>> >>> sales1<- SparkR::sql(sqlContext,"Select a.* , 607 as C1 from Sales as a") >>> >>> >>> Please help me with this , as the orignal file is only 20MB but it >>> throws out of memory exception on a cluster of 4GB Master and Two workers >>> of 4GB each. >>> >>> Also, what is the logic with DataFrame do I need to register and drop >>> tempTable after every update?? >>> >>> Thanks, >>> Vipul >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-DataFrame-Out-of-memory-exception-for-very-small-file-tp25435.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >>> additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >> >> -- >> Regards, >> Vipul Rai >> www.vipulrai.me >> +91-8892598819 >> <http://in.linkedin.com/in/vipulrai/> >> > > > > -- > Best Regards > > Jeff Zhang > -- Regards, Vipul Rai www.vipulrai.me +91-8892598819 <http://in.linkedin.com/in/vipulrai/>
Re: SparkR DataFrame , Out of memory exception for very small file.
If possible, could you share your code ? What kind of operation are you doing on the dataframe ? On Mon, Nov 23, 2015 at 5:10 PM, Vipul Rai <vipulrai8...@gmail.com> wrote: > Hi Zeff, > > Thanks for the reply, but could you tell me why is it taking so much time. > What could be wrong , also when I remove the DataFrame from memory using > rm(). > It does not clear the memory but the object is deleted. > > Also , What about the R functions which are not supported in SparkR. > Like ddply ?? > > How to access the nth ROW of SparkR DataFrame. > > Regards, > Vipul > > On 23 November 2015 at 14:25, Jeff Zhang <zjf...@gmail.com> wrote: > >> >>> Do I need to create a new DataFrame for every update to the >> DataFrame like >> addition of new column or need to update the original sales DataFrame. >> >> Yes, DataFrame is immutable, and every mutation of DataFrame will produce >> a new DataFrame. >> >> >> >> On Mon, Nov 23, 2015 at 4:44 PM, Vipul Rai <vipulrai8...@gmail.com> >> wrote: >> >>> Hello Rui, >>> >>> Sorry , What I meant was the resultant of the original dataframe to >>> which a new column was added gives a new DataFrame. >>> >>> Please check this for more >>> >>> https://spark.apache.org/docs/1.5.1/api/R/index.html >>> >>> Check for >>> WithColumn >>> >>> >>> Thanks, >>> Vipul >>> >>> >>> On 23 November 2015 at 12:42, Sun, Rui <rui@intel.com> wrote: >>> >>>> Vipul, >>>> >>>> Not sure if I understand your question. DataFrame is immutable. You >>>> can't update a DataFrame. >>>> >>>> Could you paste some log info for the OOM error? >>>> >>>> -Original Message- >>>> From: vipulrai [mailto:vipulrai8...@gmail.com] >>>> Sent: Friday, November 20, 2015 12:11 PM >>>> To: user@spark.apache.org >>>> Subject: SparkR DataFrame , Out of memory exception for very small file. >>>> >>>> Hi Users, >>>> >>>> I have a general doubt regarding DataFrames in SparkR. >>>> >>>> I am trying to read a file from Hive and it gets created as DataFrame. >>>> >>>> sqlContext <- sparkRHive.init(sc) >>>> >>>> #DF >>>> sales <- read.df(sqlContext, "hdfs://sample.csv", header ='true', >>>> source = "com.databricks.spark.csv", >>>> inferSchema='true') >>>> >>>> registerTempTable(sales,"Sales") >>>> >>>> Do I need to create a new DataFrame for every update to the DataFrame >>>> like addition of new column or need to update the original sales >>>> DataFrame. >>>> >>>> sales1<- SparkR::sql(sqlContext,"Select a.* , 607 as C1 from Sales as >>>> a") >>>> >>>> >>>> Please help me with this , as the orignal file is only 20MB but it >>>> throws out of memory exception on a cluster of 4GB Master and Two workers >>>> of 4GB each. >>>> >>>> Also, what is the logic with DataFrame do I need to register and drop >>>> tempTable after every update?? >>>> >>>> Thanks, >>>> Vipul >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-DataFrame-Out-of-memory-exception-for-very-small-file-tp25435.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> - >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >>>> additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >>> >>> -- >>> Regards, >>> Vipul Rai >>> www.vipulrai.me >>> +91-8892598819 >>> <http://in.linkedin.com/in/vipulrai/> >>> >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > > > -- > Regards, > Vipul Rai > www.vipulrai.me > +91-8892598819 > <http://in.linkedin.com/in/vipulrai/> > -- Best Regards Jeff Zhang
RE: SparkR DataFrame , Out of memory exception for very small file.
Vipul, Not sure if I understand your question. DataFrame is immutable. You can't update a DataFrame. Could you paste some log info for the OOM error? -Original Message- From: vipulrai [mailto:vipulrai8...@gmail.com] Sent: Friday, November 20, 2015 12:11 PM To: user@spark.apache.org Subject: SparkR DataFrame , Out of memory exception for very small file. Hi Users, I have a general doubt regarding DataFrames in SparkR. I am trying to read a file from Hive and it gets created as DataFrame. sqlContext <- sparkRHive.init(sc) #DF sales <- read.df(sqlContext, "hdfs://sample.csv", header ='true', source = "com.databricks.spark.csv", inferSchema='true') registerTempTable(sales,"Sales") Do I need to create a new DataFrame for every update to the DataFrame like addition of new column or need to update the original sales DataFrame. sales1<- SparkR::sql(sqlContext,"Select a.* , 607 as C1 from Sales as a") Please help me with this , as the orignal file is only 20MB but it throws out of memory exception on a cluster of 4GB Master and Two workers of 4GB each. Also, what is the logic with DataFrame do I need to register and drop tempTable after every update?? Thanks, Vipul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-DataFrame-Out-of-memory-exception-for-very-small-file-tp25435.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org