Re: [R] merging corpora and metadata
Le jeudi 17 novembre 2011 à 21:34 -0500, R. Michael Weylandt a écrit : Hi Josh, You're absolutely right. I suppose one could set up some sort of S3 thing for Henri's problem: c - function(..., recursive = FALSE) UseMethod(c) c.default - base::c c.corpus - function(..., recursive = FALSE) {ans = c.default(...); attributes(ans) - c(do.call(attributes, ...))} But agreed, it seems deeply risky. This method already exists in the tm package where the Corpus class comes from. Henri-Paul, see ?c.Corpus. Specifically, tot.corpus - c(corpus.1, corpus.2, recursive=TRUE) meta(tot.corpus) works. It looks weird that recursive=TRUE isn't the default, but the documentation seems to imply that the merging of meta-data might produce weird results, so that's probably why it's disabled by default. You may want to get in touch with Ingo Feinerer about that. Regards __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging corpora and metadata
Hi Henri-Paul, This can be rather tricky. It would really help if you could give us a reproducible example. In this case, because you are dealing with non standard data structures (or at least added attributes), the data exactly as R sees it. This means either A) code to create some data that demonstrates your problem or B) the output of calling dput(corpus.1) (see ?dput for what it does and what to do). One possibility (though it does not concatenate per se): combined - list(corpus.1, corpus.2) *if* (there are only attributes in corpus.1 OR corpus.2) OR (the attribute names in corpus.1 and corpus.2 are unique), then you could do: combined - c(corpus.1, corpus.2) attributes(combined) - c(attributes(corpus.1), attributes(corpus.2) but note that it is *very* likely that at least the names attributes overlap, so you would need to address that somehow. If attributes overlap, you need to somehow merge them, and what is an appropriate way to do that, I have no idea without knowing more about the data and what is expected by functions that work with it. Best regards, Josh On Thu, Nov 17, 2011 at 1:43 PM, Henri-Paul Indiogine hindiog...@gmail.com wrote: Greetings! I loose all my metadata after concatenating corpora. This is an example of what happens: meta(corpus.1) MetaID cid fid selfirst selend fname 1 0 1 11 2169 2518 WCPD-2001-01-29-Pg217.scrb 2 0 1 14 9189 9702 WCPD-2003-01-13-Pg39.scrb 3 0 1 14 2109 2577 WCPD-2003-01-13-Pg39.scrb 17 0 1 114 17863 18256 WCPD-2007-04-30-Pg515.scrb meta(corpus.2) MetaID cid fid selfirst selend fname 1 0 2 2 11016 11600 DCPD-200900595.scrb 2 0 2 6 19510 20098 DCPD-201000636.scrb 3 0 2 6 23935 24573 DCPD-201000636.scrb 94 0 2 127 16225 17128 WCPD-2009-01-12-Pg22-3.scrb tot.corpus - c(corpus.1, corpus.2) meta(tot.corpus) MetaID 1 0 2 0 3 0 111 0 This is from the structure of corpus.1 ..$ MetaData:List of 2 .. ..$ create_date: POSIXlt[1:1], format: 2011-11-17 21:09:57 .. ..$ creator : chr henk ..$ Children: NULL ..- attr(*, class)= chr MetaDataNode - attr(*, DMetaData)='data.frame': 17 obs. of 6 variables: ..$ MetaID : num [1:17] 0 0 0 0 0 0 0 0 0 0 ... ..$ cid : int [1:17] 1 1 1 1 1 1 1 1 1 1 ... ..$ fid : int [1:17] 11 14 14 17 46 80 80 80 91 91 ... ..$ selfirst: num [1:17] 2169 9189 2109 8315 9439 ... ..$ selend : num [1:17] 2518 9702 2577 8881 10102 ... ..$ fname : chr [1:17] WCPD-2001-01-29-Pg217.scrb WCPD-2003-01-13-Pg39.scrb WCPD-2003-01-13-Pg39.scrb WCPD-2004-05-17-Pg856.scrb ... - attr(*, class)= chr [1:3] VCorpus Corpus list Any idea on what I could do to keep the metadata in the merged corpus? Thanks, Henri-Paul -- Henri-Paul Indiogine Curriculum Instruction Texas AM University TutorFind Learning Centre Email: hindiog...@gmail.com Skype: hindiogine Website: http://people.cehd.tamu.edu/~sindiogine __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, ATS Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging corpora and metadata
What package is all this from()? You might check if there is a special rbind/cbind method provided. I don't think you can easily change the behavior of c() Michael On Nov 17, 2011, at 4:43 PM, Henri-Paul Indiogine hindiog...@gmail.com wrote: Greetings! I loose all my metadata after concatenating corpora. This is an example of what happens: meta(corpus.1) MetaID cid fid selfirst selend fname 1 0 1 11 2169 2518WCPD-2001-01-29-Pg217.scrb 2 0 1 14 9189 9702 WCPD-2003-01-13-Pg39.scrb 3 0 1 14 2109 2577 WCPD-2003-01-13-Pg39.scrb 17 0 1 11417863 18256WCPD-2007-04-30-Pg515.scrb meta(corpus.2) MetaID cid fid selfirst selend fname 1 0 2 211016 11600 DCPD-200900595.scrb 2 0 2 619510 20098 DCPD-201000636.scrb 3 0 2 623935 24573 DCPD-201000636.scrb 94 0 2 12716225 17128 WCPD-2009-01-12-Pg22-3.scrb tot.corpus - c(corpus.1, corpus.2) meta(tot.corpus) MetaID 10 20 30 111 0 This is from the structure of corpus.1 ..$ MetaData:List of 2 .. ..$ create_date: POSIXlt[1:1], format: 2011-11-17 21:09:57 .. ..$ creator: chr henk ..$ Children: NULL ..- attr(*, class)= chr MetaDataNode - attr(*, DMetaData)='data.frame':17 obs. of 6 variables: ..$ MetaID : num [1:17] 0 0 0 0 0 0 0 0 0 0 ... ..$ cid : int [1:17] 1 1 1 1 1 1 1 1 1 1 ... ..$ fid : int [1:17] 11 14 14 17 46 80 80 80 91 91 ... ..$ selfirst: num [1:17] 2169 9189 2109 8315 9439 ... ..$ selend : num [1:17] 2518 9702 2577 8881 10102 ... ..$ fname : chr [1:17] WCPD-2001-01-29-Pg217.scrb WCPD-2003-01-13-Pg39.scrb WCPD-2003-01-13-Pg39.scrb WCPD-2004-05-17-Pg856.scrb ... - attr(*, class)= chr [1:3] VCorpus Corpus list Any idea on what I could do to keep the metadata in the merged corpus? Thanks, Henri-Paul -- Henri-Paul Indiogine Curriculum Instruction Texas AM University TutorFind Learning Centre Email: hindiog...@gmail.com Skype: hindiogine Website: http://people.cehd.tamu.edu/~sindiogine __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging corpora and metadata
Hi Michael, require(sos) findFn({meta}, sortby = Function) ## see that only two functions have the exact name, 'meta' ## one is titled, Meta Data Management in the package 'tm' ## seems a pretty likely choice Also, the fact that it is a truly terrible idea does not mean it is not easy: mvir - new.env() mvir$c - function(x, ...) {cat(sure you can!\n); mean(x, ...)} attach(mvir) c(x = 1:10) detach(mvir) rm(mvir) Cheers, Josh On Thu, Nov 17, 2011 at 5:25 PM, R. Michael Weylandt michael.weyla...@gmail.com michael.weyla...@gmail.com wrote: What package is all this from()? You might check if there is a special rbind/cbind method provided. I don't think you can easily change the behavior of c() Michael On Nov 17, 2011, at 4:43 PM, Henri-Paul Indiogine hindiog...@gmail.com wrote: Greetings! I loose all my metadata after concatenating corpora. This is an example of what happens: meta(corpus.1) MetaID cid fid selfirst selend fname 1 0 1 11 2169 2518 WCPD-2001-01-29-Pg217.scrb 2 0 1 14 9189 9702 WCPD-2003-01-13-Pg39.scrb 3 0 1 14 2109 2577 WCPD-2003-01-13-Pg39.scrb 17 0 1 114 17863 18256 WCPD-2007-04-30-Pg515.scrb meta(corpus.2) MetaID cid fid selfirst selend fname 1 0 2 2 11016 11600 DCPD-200900595.scrb 2 0 2 6 19510 20098 DCPD-201000636.scrb 3 0 2 6 23935 24573 DCPD-201000636.scrb 94 0 2 127 16225 17128 WCPD-2009-01-12-Pg22-3.scrb tot.corpus - c(corpus.1, corpus.2) meta(tot.corpus) MetaID 1 0 2 0 3 0 111 0 This is from the structure of corpus.1 ..$ MetaData:List of 2 .. ..$ create_date: POSIXlt[1:1], format: 2011-11-17 21:09:57 .. ..$ creator : chr henk ..$ Children: NULL ..- attr(*, class)= chr MetaDataNode - attr(*, DMetaData)='data.frame': 17 obs. of 6 variables: ..$ MetaID : num [1:17] 0 0 0 0 0 0 0 0 0 0 ... ..$ cid : int [1:17] 1 1 1 1 1 1 1 1 1 1 ... ..$ fid : int [1:17] 11 14 14 17 46 80 80 80 91 91 ... ..$ selfirst: num [1:17] 2169 9189 2109 8315 9439 ... ..$ selend : num [1:17] 2518 9702 2577 8881 10102 ... ..$ fname : chr [1:17] WCPD-2001-01-29-Pg217.scrb WCPD-2003-01-13-Pg39.scrb WCPD-2003-01-13-Pg39.scrb WCPD-2004-05-17-Pg856.scrb ... - attr(*, class)= chr [1:3] VCorpus Corpus list Any idea on what I could do to keep the metadata in the merged corpus? Thanks, Henri-Paul -- Henri-Paul Indiogine Curriculum Instruction Texas AM University TutorFind Learning Centre Email: hindiog...@gmail.com Skype: hindiogine Website: http://people.cehd.tamu.edu/~sindiogine __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, ATS Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging corpora and metadata
Hi Josh, You're absolutely right. I suppose one could set up some sort of S3 thing for Henri's problem: c - function(..., recursive = FALSE) UseMethod(c) c.default - base::c c.corpus - function(..., recursive = FALSE) {ans = c.default(...); attributes(ans) - c(do.call(attributes, ...))} But agreed, it seems deeply risky. Cheers, Michael On Thu, Nov 17, 2011 at 9:01 PM, Joshua Wiley jwiley.ps...@gmail.com wrote: Hi Michael, require(sos) findFn({meta}, sortby = Function) ## see that only two functions have the exact name, 'meta' ## one is titled, Meta Data Management in the package 'tm' ## seems a pretty likely choice Also, the fact that it is a truly terrible idea does not mean it is not easy: mvir - new.env() mvir$c - function(x, ...) {cat(sure you can!\n); mean(x, ...)} attach(mvir) c(x = 1:10) detach(mvir) rm(mvir) Cheers, Josh On Thu, Nov 17, 2011 at 5:25 PM, R. Michael Weylandt michael.weyla...@gmail.com michael.weyla...@gmail.com wrote: What package is all this from()? You might check if there is a special rbind/cbind method provided. I don't think you can easily change the behavior of c() Michael On Nov 17, 2011, at 4:43 PM, Henri-Paul Indiogine hindiog...@gmail.com wrote: Greetings! I loose all my metadata after concatenating corpora. This is an example of what happens: meta(corpus.1) MetaID cid fid selfirst selend fname 1 0 1 11 2169 2518 WCPD-2001-01-29-Pg217.scrb 2 0 1 14 9189 9702 WCPD-2003-01-13-Pg39.scrb 3 0 1 14 2109 2577 WCPD-2003-01-13-Pg39.scrb 17 0 1 114 17863 18256 WCPD-2007-04-30-Pg515.scrb meta(corpus.2) MetaID cid fid selfirst selend fname 1 0 2 2 11016 11600 DCPD-200900595.scrb 2 0 2 6 19510 20098 DCPD-201000636.scrb 3 0 2 6 23935 24573 DCPD-201000636.scrb 94 0 2 127 16225 17128 WCPD-2009-01-12-Pg22-3.scrb tot.corpus - c(corpus.1, corpus.2) meta(tot.corpus) MetaID 1 0 2 0 3 0 111 0 This is from the structure of corpus.1 ..$ MetaData:List of 2 .. ..$ create_date: POSIXlt[1:1], format: 2011-11-17 21:09:57 .. ..$ creator : chr henk ..$ Children: NULL ..- attr(*, class)= chr MetaDataNode - attr(*, DMetaData)='data.frame': 17 obs. of 6 variables: ..$ MetaID : num [1:17] 0 0 0 0 0 0 0 0 0 0 ... ..$ cid : int [1:17] 1 1 1 1 1 1 1 1 1 1 ... ..$ fid : int [1:17] 11 14 14 17 46 80 80 80 91 91 ... ..$ selfirst: num [1:17] 2169 9189 2109 8315 9439 ... ..$ selend : num [1:17] 2518 9702 2577 8881 10102 ... ..$ fname : chr [1:17] WCPD-2001-01-29-Pg217.scrb WCPD-2003-01-13-Pg39.scrb WCPD-2003-01-13-Pg39.scrb WCPD-2004-05-17-Pg856.scrb ... - attr(*, class)= chr [1:3] VCorpus Corpus list Any idea on what I could do to keep the metadata in the merged corpus? Thanks, Henri-Paul -- Henri-Paul Indiogine Curriculum Instruction Texas AM University TutorFind Learning Centre Email: hindiog...@gmail.com Skype: hindiogine Website: http://people.cehd.tamu.edu/~sindiogine __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, ATS Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] merging corpora and metadata
Hi Joshua! 2011/11/17 Joshua Wiley jwiley.ps...@gmail.com: One possibility (though it does not concatenate per se): combined - list(corpus.1, corpus.2) Thanks I will look into it. *if* (there are only attributes in corpus.1 OR corpus.2) OR (the attribute names in corpus.1 and corpus.2 are unique), then you could do: Unfortunately this is not the case.In the meanwhile I rewrote the code that generates the corpus so that the documents are combined into a single corpus _before_ the metadata are added. That solved the problem. Thanks for your feedback and suggestions. Henri-Paul -- Henri-Paul Indiogine Curriculum Instruction Texas AM University TutorFind Learning Centre Email: hindiog...@gmail.com Skype: hindiogine Website: http://people.cehd.tamu.edu/~sindiogine __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.