Re: [R] merge: right set overwrite left set
Thank you Ista, Your solution is smart, by sub-setting from x.HHu.map data only HHid, position as indices (because they are unique) for the merge, and any extra columns in x.HHu.map that are not present in y.HHo,map, thus when the merge is done with option all=T, will work among the two sets of data, without creating .x and .y when the variables are in common in two sets. With best wishes, Aldi ## find indv columns in x.HHu.map that don't exist in y.HHo.map x.HHu.map - x.HHu.map[ + c(HHid, + position, + names(x.HHu.map)[ +!names(x.HHu.map) +%in% names(y.HHo.map)] + )] ## merge, adding extra column from x.HHu.map zzz - merge(y.HHo.map, x.HHu.map, by=c('HHid', 'position'), all=T) ## order by HHid zzz - zzz[order(zzz$HHid),] zzz HHid position indv1 indv2 ind3 1 HH1 10 2 00 2 HH10 101NA 20 3 HH11 111NANA NA 4 HH2 20 0 20 5 HH3 30 0 10 6 HH4 42NANA0 7 HH5 55 2 20 8 HH6 66NANA NA 9 HH7 75NANA NA 10 HH8 81NANA NA 11 HH9 92NANA NA On 7/12/2015 12:06 PM, Ista Zahn wrote: I think this does what you want: ## find idiv coloumns in x.HHu.map that don't exist in y.HHo.map x.HHu.map - x.HHu.map[ c(HHid, position, names(x.HHu.map)[ !names(x.HHu.map) %in% names(y.HHo.map)] )] ## merge, adding extra column from x.HHu.map zzz - merge(y.HHo.map, x.HHu.map, by=c('HHid', 'position'), all=T) ## order by HHid zzz - zzz[order(zzz$HHid),] Best, Ista On Sun, Jul 12, 2015 at 10:45 AM, aldi a...@dsgmail.wustl.edu wrote: Hi, I have two sets of data x.HHu and y.HHo, rows are IDs and columns are individuals. I do not know in advance indv or HHid, both of them will be captured from the data. As the y.HHo set updates, y.HHo set has better information then x.HHu set. Thus I want a merge where right set overwrites left set info based on HHid, i.e. to overwrite x.HHu set with y.HHo set but keep any extra info from the x.HHu set that is not present in y.HHo set. HHids will be complete based on z.map, with the corresponding positions. I am having trouble with the part after this line: ### +++ I am thinking that I am creating new columns position indv1 and indv2, but R is interpreting them as row information. See the expected final table at the end. HHid is common, indv3 is from x.HHu, and the rest position and indv1 and indv2 are from y.HHo Any suggestions are appreciated. Thank you in advance, Aldi x.HHu- data.frame( HHid = c( 'HH1', 'HH2', 'HH3', 'HH4', 'HH5', 'HH10') , indv1 = c( 2, 0, 2 , 0, 2, 0) , indv2 = c( 0, NA, 2, 2, 2, 2) , ind3 = c( 0, 0, 0, 0, 0, 0) ) ### the HHo data will be the top set to overwrite any HHu data, when they exist, thinking that HHo are better than HHu results ### when they are available y.HHo-data.frame(HHid=c('HH1', 'HH2','HH5', 'HH3', 'HH10') , indv1 = c(2, 0, 2, 0, NA) , indv2 = c(0, 2, 2, 1, 2) ) z.map-data.frame(HHid = c('HH1', 'HH2', 'HH3', 'HH4', 'HH5', 'HH6','HH8', 'HH7', 'HH9', 'HH10', 'HH11') , position= c(10,20,30,42,55,66,81,75,92,101,111) ) ### see objects x.HHu y.HHo z.map ### now sort the map by position, this sorted map will be used to sort finally all data z.map-z.map[with(z.map, order(position)), ] z.map ### First I introduce position to both sets so I can sort them in advance by position. x.HHu.map -merge( z.map, x.HHu, by='HHid', all=T) x.HHu.map-x.HHu.map[with(x.HHu.map, order(position)), ] x.HHu.map y.HHo.map -merge( z.map, y.HHo, by='HHid', all= T) y.HHo.map-y.HHo.map[with(y.HHo.map, order(position)), ] y.HHo.map ### now merge HHu and HHo with the hope to overwrite the HHu set with HHo wherever they overlap by column names. zzz - merge(x.HHu.map, y.HHo.map, by='HHid', all=T) zzz ### find common variable names in two sets commonNames - names(x.impu.map)[which(colnames(x.impu.map) %in% colnames(y.geno.map))] ## remove HHid wich is common for x and y, but work with the rest of columns commonNames-commonNames[-c(1)] ### +++ for(i in 1:length(commonNames)){ print(commonNames[i]) zzz$commonNames[i] - NA print(paste(zzz,$,commonNames[i],.y,sep=)) zzz$commonNames[i] - zzz[,paste(commonNames[i],.y,sep=)] ### paste(zzz$commonNames[i],.x,sep='') - NULL; ### paste(zzz$commonNames[i],.y,sep='') - NULL; } zzz The final expected set has to be: HHid is common, indv3 is from x.HHu, and the rest position and indv1 and indv2 are from y.HHo HHid position ind3 indv1 indv2 1 HH1 10
Re: [R] merge: right set overwrite left set
Thank you Jeff, Your solutions have two great aspects: a) you provide a different approach by using reshape2 syntax / tidyr, and b) the concern that it is better to update x.HHu.map with y.HHo.map, without overwriting x.HHu.map with NA from y.HHo.map, thus keeping intact the old value(s). That is the ideal situation, but I do not know exactly if the old value is correct, which if it is not correct then it may create strata in the data. Therefore I prefer better to overwrite any old column with the new column when it exits, even with NA. In addition, you introduce the safety stringAsFactors=F. The map is only for getting position based on HHid, and at the end I plan sorting final set based on position. While the reshape2 worked with no problems from the start, the other solution with # tidyr/dplyr solution library(tidyr) library(dplyr) did not work correct in the start (it stopped R working and OS says R have to close) in windows 8.1 with R 3.1.2, quite possible because called libraries were compiled under R 3.1.3. When installed the new version of R: 3.2.1, the second solution worked also with no problem. For whoever has written the libraries tidyr and dplyr, they produced warnings for function name conflicts with base and stats: Attaching package: �dplyr� The following objects are masked from �package:stats�: filter, lag The following objects are masked from �package:base�: intersect, setdiff, setequal, union Great solutions! Thank you, Aldi x.HHu - data.frame( + HHid = c( 'HH1', 'HH2', 'HH3', 'HH4', 'HH5', 'HH10' ) + , indv1 = c( 2, 0, 2 , 0, 2, 0 ) + , indv2 = c( 0, NA, 2, 2, 2, 2 ) + , ind3 = c( 0, 0, 0, 0, 0, 0 ) + , stringsAsFactors = FALSE # avoid creating HHid as a factor + ) y.HHo - data.frame( + HHid=c( 'HH1', 'HH2','HH5', 'HH3', 'HH10' ) + , indv1 = c( 2, 0, 2, 0, NA ) + , indv2 = c( 0, 2, 2, 1, 2 ) + , stringsAsFactors = FALSE + ) z.map - data.frame( + HHid = c( 'HH1', 'HH2', 'HH3', 'HH4', 'HH5' + , 'HH6','HH8', 'HH7', 'HH9', 'HH10', 'HH11' ) + , position= c( 10, 20, 30, 42, 55, 66, 81, 75, 92, 101, 111 ) + , stringsAsFactors = FALSE + ) # reshape2 solution library(reshape2) x.HHu.long - melt( x.HHu, HHid, variable.name = indv ) x.HHu.long$indv - as.character( x.HHu.long$indv ) y.HHo.long - melt( y.HHo, HHid, variable.name = indv ) y.HHo.long$indv - as.character( y.HHo.long$indv ) xy.HH.long - merge( x.HHu.long +, y.HHo.long +, by = c( HHid, indv ) +, all = TRUE ) xy.HH.long$value - with( xy.HH.long + , ifelse( is.na( value.y ) + , value.x + , value.y ) ) xy.HH0 - dcast( xy.HH.long, HHid ~ indv ) xy.HH - merge( xy.HH0, z.map, all=TRUE ) xy.HH - xy.HH[ order( xy.HH$HHid ), ] # compare xy.HH with zzz ... I think there is an error in zzz for # HH10/indv1, because NA should not be considered more informative # than 0... ### solution with reshape2 xy.HH HHid ind3 indv1 indv2 position 1 HH10 2 0 10 2 HH100 0 2 101 3 HH11 NANANA 111 4 HH20 0 2 20 5 HH30 0 1 30 6 HH40 0 2 42 7 HH50 2 2 55 8 HH6 NANANA 66 9 HH7 NANANA 75 10 HH8 NANANA 81 11 HH9 NANANA 92 ### Solution with tidyr: xy.HH.d HHid ind3 indv1 indv2 position 1 HH10 2 0 10 2 HH100 0 2 101 3 HH11 NANANA 111 4 HH20 0 2 20 5 HH30 0 1 30 6 HH40 0 2 42 7 HH50 2 2 55 8 HH6 NANANA 66 9 HH7 NANANA 75 10 HH8 NANANA 81 11 HH9 NANANA 92 On 7/12/2015 1:35 PM, Jeff Newmiller wrote: I get confused by your use of the position map table. If I follow your description toward your desired result, I take a different route that makes sense to me. Perhaps it will make sense to you as well. The key idea is to make individual comparisons of the values for each combination of HHid and indv, regardless of where they are in the original data frames. Below are two different syntactic representations of my understanding of your problem. They differ only in the approach taken to strip away syntax clutter. I start by making the HHid identifiers character values in the original data frames, because their respective factor levels in the two data frames would not necessarily correspond. x.HHu - data.frame( HHid = c( 'HH1', 'HH2', 'HH3', 'HH4', 'HH5', 'HH10' ) , indv1 = c( 2, 0, 2 , 0, 2, 0 ) , indv2 = c( 0, NA, 2, 2, 2, 2 ) , ind3 = c( 0, 0, 0, 0, 0, 0 ) , stringsAsFactors = FALSE # avoid creating HHid as a factor ) y.HHo - data.frame( HHid=c( 'HH1', 'HH2','HH5', 'HH3',
Re: [R] merge: right set overwrite left set
I get confused by your use of the position map table. If I follow your description toward your desired result, I take a different route that makes sense to me. Perhaps it will make sense to you as well. The key idea is to make individual comparisons of the values for each combination of HHid and indv, regardless of where they are in the original data frames. Below are two different syntactic representations of my understanding of your problem. They differ only in the approach taken to strip away syntax clutter. I start by making the HHid identifiers character values in the original data frames, because their respective factor levels in the two data frames would not necessarily correspond. x.HHu - data.frame( HHid = c( 'HH1', 'HH2', 'HH3', 'HH4', 'HH5', 'HH10' ) , indv1 = c( 2, 0, 2 , 0, 2, 0 ) , indv2 = c( 0, NA, 2, 2, 2, 2 ) , ind3 = c( 0, 0, 0, 0, 0, 0 ) , stringsAsFactors = FALSE # avoid creating HHid as a factor ) y.HHo - data.frame( HHid=c( 'HH1', 'HH2','HH5', 'HH3', 'HH10' ) , indv1 = c( 2, 0, 2, 0, NA ) , indv2 = c( 0, 2, 2, 1, 2 ) , stringsAsFactors = FALSE ) z.map - data.frame( HHid = c( 'HH1', 'HH2', 'HH3', 'HH4', 'HH5' , 'HH6','HH8', 'HH7', 'HH9', 'HH10', 'HH11' ) , position= c( 10, 20, 30, 42, 55, 66, 81, 75, 92, 101, 111 ) , stringsAsFactors = FALSE ) # reshape2 solution library(reshape2) x.HHu.long - melt( x.HHu, HHid, variable.name = indv ) x.HHu.long$indv - as.character( x.HHu.long$indv ) y.HHo.long - melt( y.HHo, HHid, variable.name = indv ) y.HHo.long$indv - as.character( y.HHo.long$indv ) xy.HH.long - merge( x.HHu.long , y.HHo.long , by = c( HHid, indv ) , all = TRUE ) xy.HH.long$value - with( xy.HH.long , ifelse( is.na( value.y ) , value.x , value.y ) ) xy.HH0 - dcast( xy.HH.long, HHid ~ indv ) xy.HH - merge( xy.HH0, z.map, all=TRUE ) xy.HH - xy.HH[ order( xy.HH$HHid ), ] # compare xy.HH with zzz ... I think there is an error in zzz for # HH10/indv1, because NA should not be considered more informative # than 0... # tidyr/dplyr solution library(tidyr) library(dplyr) # define a common processing sequence lengthen - ( . # period is placeholder for data frame %% gather( indv, value, -HHid ) %% mutate( indv = as.character( indv ) ) ) x.HHu.dlong - x.HHu %% lengthen y.HHo.dlong - y.HHo %% lengthen xy.HH.d - ( x.HHu.dlong %% full_join( y.HHo.dlong, by= c( HHid, indv ) ) %% transmute( HHid = HHid , indv = indv , value = ifelse( is.na( value.y ) , value.x , value.y ) ) %% spread( indv, value ) %% full_join( z.map, by=HHid ) %% arrange( HHid ) %% as.data.frame ) On Sun, 12 Jul 2015, aldi wrote: Hi, I have two sets of data x.HHu and y.HHo, rows are IDs and columns are individuals. I do not know in advance indv or HHid, both of them will be captured from the data. As the y.HHo set updates, y.HHo set has better information then x.HHu set. Thus I want a merge where right set overwrites left set info based on HHid, i.e. to overwrite x.HHu set with y.HHo set but keep any extra info from the x.HHu set that is not present in y.HHo set. HHids will be complete based on z.map, with the corresponding positions. I am having trouble with the part after this line: ### +++ I am thinking that I am creating new columns position indv1 and indv2, but R is interpreting them as row information. See the expected final table at the end. HHid is common, indv3 is from x.HHu, and the rest position and indv1 and indv2 are from y.HHo Any suggestions are appreciated. Thank you in advance, Aldi x.HHu- data.frame( HHid = c( 'HH1', 'HH2', 'HH3', 'HH4', 'HH5', 'HH10') , indv1 = c( 2, 0, 2 , 0, 2, 0) , indv2 = c( 0, NA, 2, 2, 2, 2) , ind3 = c( 0, 0, 0, 0, 0, 0) ) ### the HHo data will be the top set to overwrite any HHu data, when they exist, thinking that HHo are better than HHu results ### when they are available y.HHo-data.frame(HHid=c('HH1', 'HH2','HH5', 'HH3', 'HH10') , indv1 = c(2, 0, 2, 0, NA) , indv2 = c(0, 2, 2, 1, 2) ) z.map-data.frame(HHid = c('HH1', 'HH2', 'HH3', 'HH4', 'HH5', 'HH6','HH8', 'HH7', 'HH9', 'HH10', 'HH11') , position= c(10,20,30,42,55,66,81,75,92,101,111) ) ### see objects x.HHu y.HHo z.map ### now sort the map by position, this sorted map will be used to sort finally all data z.map-z.map[with(z.map, order(position)), ] z.map ### First I introduce position to both sets so I can sort them in advance by position. x.HHu.map -merge( z.map, x.HHu, by='HHid', all=T)
[R] merge: right set overwrite left set
Hi, I have two sets of data x.HHu and y.HHo, rows are IDs and columns are individuals. I do not know in advance indv or HHid, both of them will be captured from the data. As the y.HHo set updates, y.HHo set has better information then x.HHu set. Thus I want a merge where right set overwrites left set info based on HHid, i.e. to overwrite x.HHu set with y.HHo set but keep any extra info from the x.HHu set that is not present in y.HHo set. HHids will be complete based on z.map, with the corresponding positions. I am having trouble with the part after this line: ### +++ I am thinking that I am creating new columns position indv1 and indv2, but R is interpreting them as row information. See the expected final table at the end. HHid is common, indv3 is from x.HHu, and the rest position and indv1 and indv2 are from y.HHo Any suggestions are appreciated. Thank you in advance, Aldi x.HHu- data.frame( HHid = c( 'HH1', 'HH2', 'HH3', 'HH4', 'HH5', 'HH10') , indv1 = c( 2, 0, 2 , 0, 2, 0) , indv2 = c( 0, NA, 2, 2, 2, 2) , ind3 = c( 0, 0, 0, 0, 0, 0) ) ### the HHo data will be the top set to overwrite any HHu data, when they exist, thinking that HHo are better than HHu results ### when they are available y.HHo-data.frame(HHid=c('HH1', 'HH2','HH5', 'HH3', 'HH10') , indv1 = c(2, 0, 2, 0, NA) , indv2 = c(0, 2, 2, 1, 2) ) z.map-data.frame(HHid = c('HH1', 'HH2', 'HH3', 'HH4', 'HH5', 'HH6','HH8', 'HH7', 'HH9', 'HH10', 'HH11') , position= c(10,20,30,42,55,66,81,75,92,101,111) ) ### see objects x.HHu y.HHo z.map ### now sort the map by position, this sorted map will be used to sort finally all data z.map-z.map[with(z.map, order(position)), ] z.map ### First I introduce position to both sets so I can sort them in advance by position. x.HHu.map -merge( z.map, x.HHu, by='HHid', all=T) x.HHu.map-x.HHu.map[with(x.HHu.map, order(position)), ] x.HHu.map y.HHo.map -merge( z.map, y.HHo, by='HHid', all= T) y.HHo.map-y.HHo.map[with(y.HHo.map, order(position)), ] y.HHo.map ### now merge HHu and HHo with the hope to overwrite the HHu set with HHo wherever they overlap by column names. zzz - merge(x.HHu.map, y.HHo.map, by='HHid', all=T) zzz ### find common variable names in two sets commonNames - names(x.impu.map)[which(colnames(x.impu.map) %in% colnames(y.geno.map))] ## remove HHid wich is common for x and y, but work with the rest of columns commonNames-commonNames[-c(1)] ### +++ for(i in 1:length(commonNames)){ print(commonNames[i]) zzz$commonNames[i] - NA print(paste(zzz,$,commonNames[i],.y,sep=)) zzz$commonNames[i] - zzz[,paste(commonNames[i],.y,sep=)] ### paste(zzz$commonNames[i],.x,sep='') - NULL; ### paste(zzz$commonNames[i],.y,sep='') - NULL; } zzz The final expected set has to be: HHid is common, indv3 is from x.HHu, and the rest position and indv1 and indv2 are from y.HHo HHid position ind3 indv1 indv2 1 HH1 10 0 2 0 2 HH10101 0NA 2 3 HH11111 NANA NA 4 HH2 20 0 0 2 5 HH3 30 0 0 1 6 HH4 42 0NA NA 7 HH5 55 0 2 2 8 HH6 66 NANA NA 9 HH7 75 NANA NA 10 HH8 81 NANA NA 11 HH9 92 NANA NA -- [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merge: right set overwrite left set
I think this does what you want: ## find idiv coloumns in x.HHu.map that don't exist in y.HHo.map x.HHu.map - x.HHu.map[ c(HHid, position, names(x.HHu.map)[ !names(x.HHu.map) %in% names(y.HHo.map)] )] ## merge, adding extra column from x.HHu.map zzz - merge(y.HHo.map, x.HHu.map, by=c('HHid', 'position'), all=T) ## order by HHid zzz - zzz[order(zzz$HHid),] Best, Ista On Sun, Jul 12, 2015 at 10:45 AM, aldi a...@dsgmail.wustl.edu wrote: Hi, I have two sets of data x.HHu and y.HHo, rows are IDs and columns are individuals. I do not know in advance indv or HHid, both of them will be captured from the data. As the y.HHo set updates, y.HHo set has better information then x.HHu set. Thus I want a merge where right set overwrites left set info based on HHid, i.e. to overwrite x.HHu set with y.HHo set but keep any extra info from the x.HHu set that is not present in y.HHo set. HHids will be complete based on z.map, with the corresponding positions. I am having trouble with the part after this line: ### +++ I am thinking that I am creating new columns position indv1 and indv2, but R is interpreting them as row information. See the expected final table at the end. HHid is common, indv3 is from x.HHu, and the rest position and indv1 and indv2 are from y.HHo Any suggestions are appreciated. Thank you in advance, Aldi x.HHu- data.frame( HHid = c( 'HH1', 'HH2', 'HH3', 'HH4', 'HH5', 'HH10') , indv1 = c( 2, 0, 2 , 0, 2, 0) , indv2 = c( 0, NA, 2, 2, 2, 2) , ind3 = c( 0, 0, 0, 0, 0, 0) ) ### the HHo data will be the top set to overwrite any HHu data, when they exist, thinking that HHo are better than HHu results ### when they are available y.HHo-data.frame(HHid=c('HH1', 'HH2','HH5', 'HH3', 'HH10') , indv1 = c(2, 0, 2, 0, NA) , indv2 = c(0, 2, 2, 1, 2) ) z.map-data.frame(HHid = c('HH1', 'HH2', 'HH3', 'HH4', 'HH5', 'HH6','HH8', 'HH7', 'HH9', 'HH10', 'HH11') , position= c(10,20,30,42,55,66,81,75,92,101,111) ) ### see objects x.HHu y.HHo z.map ### now sort the map by position, this sorted map will be used to sort finally all data z.map-z.map[with(z.map, order(position)), ] z.map ### First I introduce position to both sets so I can sort them in advance by position. x.HHu.map -merge( z.map, x.HHu, by='HHid', all=T) x.HHu.map-x.HHu.map[with(x.HHu.map, order(position)), ] x.HHu.map y.HHo.map -merge( z.map, y.HHo, by='HHid', all= T) y.HHo.map-y.HHo.map[with(y.HHo.map, order(position)), ] y.HHo.map ### now merge HHu and HHo with the hope to overwrite the HHu set with HHo wherever they overlap by column names. zzz - merge(x.HHu.map, y.HHo.map, by='HHid', all=T) zzz ### find common variable names in two sets commonNames - names(x.impu.map)[which(colnames(x.impu.map) %in% colnames(y.geno.map))] ## remove HHid wich is common for x and y, but work with the rest of columns commonNames-commonNames[-c(1)] ### +++ for(i in 1:length(commonNames)){ print(commonNames[i]) zzz$commonNames[i] - NA print(paste(zzz,$,commonNames[i],.y,sep=)) zzz$commonNames[i] - zzz[,paste(commonNames[i],.y,sep=)] ### paste(zzz$commonNames[i],.x,sep='') - NULL; ### paste(zzz$commonNames[i],.y,sep='') - NULL; } zzz The final expected set has to be: HHid is common, indv3 is from x.HHu, and the rest position and indv1 and indv2 are from y.HHo HHid position ind3 indv1 indv2 1 HH1 10 0 2 0 2 HH10101 0NA 2 3 HH11111 NANA NA 4 HH2 20 0 0 2 5 HH3 30 0 0 1 6 HH4 42 0NA NA 7 HH5 55 0 2 2 8 HH6 66 NANA NA 9 HH7 75 NANA NA 10 HH8 81 NANA NA 11 HH9 92 NANA NA -- [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.