Re: [R] merging corpora and metadata

2011-11-18 Thread Milan Bouchet-Valat
Le jeudi 17 novembre 2011 à 21:34 -0500, R. Michael Weylandt a écrit :
 Hi Josh,
 
 You're absolutely right. I suppose one could set up some sort of S3
 thing for Henri's problem:
 
 c - function(..., recursive = FALSE) UseMethod(c)
 c.default - base::c
 c.corpus - function(..., recursive = FALSE) {ans = c.default(...);
 attributes(ans) - c(do.call(attributes, ...))}
 
 But agreed, it seems deeply risky.
This method already exists in the tm package where the Corpus class
comes from. Henri-Paul, see ?c.Corpus.

Specifically, tot.corpus - c(corpus.1, corpus.2, recursive=TRUE)
meta(tot.corpus)
works.

It looks weird that recursive=TRUE isn't the default, but the
documentation seems to imply that the merging of meta-data might produce
weird results, so that's probably why it's disabled by default. You may
want to get in touch with Ingo Feinerer about that.


Regards

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] merging corpora and metadata

2011-11-17 Thread Joshua Wiley
Hi Henri-Paul,

This can be rather tricky.  It would really help if you could give us
a reproducible example.  In this case, because you are dealing with
non standard data structures (or at least added attributes), the data
exactly as R sees it.  This means either A) code to create some data
that demonstrates your problem or B) the output of calling
dput(corpus.1) (see ?dput for what it does and what to do).

One possibility (though it does not concatenate per se):

combined - list(corpus.1, corpus.2)

*if* (there are only attributes in corpus.1 OR corpus.2) OR (the
attribute names in corpus.1 and corpus.2 are unique), then you could
do:

combined - c(corpus.1, corpus.2)
attributes(combined) - c(attributes(corpus.1), attributes(corpus.2)

but note that it is *very* likely that at least the names attributes
overlap, so you would need to address that somehow.  If attributes
overlap, you need to somehow merge them, and what is an appropriate
way to do that, I have no idea without knowing more about the data and
what is expected by functions that work with it.

Best regards,

Josh

On Thu, Nov 17, 2011 at 1:43 PM, Henri-Paul Indiogine
hindiog...@gmail.com wrote:
 Greetings!

 I loose all my metadata after concatenating corpora. This is an
 example of what happens:

 meta(corpus.1)
   MetaID cid fid selfirst selend                         fname
 1       0   1  11     2169   2518    WCPD-2001-01-29-Pg217.scrb
 2       0   1  14     9189   9702     WCPD-2003-01-13-Pg39.scrb
 3       0   1  14     2109   2577     WCPD-2003-01-13-Pg39.scrb

 
 

 17      0   1 114    17863  18256    WCPD-2007-04-30-Pg515.scrb


 meta(corpus.2)
   MetaID cid fid selfirst selend                         fname
 1       0   2   2    11016  11600           DCPD-200900595.scrb
 2       0   2   6    19510  20098           DCPD-201000636.scrb
 3       0   2   6    23935  24573           DCPD-201000636.scrb

 
 

 94      0   2 127    16225  17128   WCPD-2009-01-12-Pg22-3.scrb


 tot.corpus - c(corpus.1, corpus.2)
 meta(tot.corpus)

    MetaID
 1        0
 2        0
 3        0

 
 

 111      0


 This is from the structure of corpus.1

 ..$ MetaData:List of 2
  .. ..$ create_date: POSIXlt[1:1], format: 2011-11-17 21:09:57
  .. ..$ creator    : chr henk
  ..$ Children: NULL
  ..- attr(*, class)= chr MetaDataNode
  - attr(*, DMetaData)='data.frame':   17 obs. of  6 variables:
  ..$ MetaID  : num [1:17] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ cid     : int [1:17] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ fid     : int [1:17] 11 14 14 17 46 80 80 80 91 91 ...
  ..$ selfirst: num [1:17] 2169 9189 2109 8315 9439 ...
  ..$ selend  : num [1:17] 2518 9702 2577 8881 10102 ...
  ..$ fname   : chr [1:17] WCPD-2001-01-29-Pg217.scrb
 WCPD-2003-01-13-Pg39.scrb WCPD-2003-01-13-Pg39.scrb
 WCPD-2004-05-17-Pg856.scrb ...
  - attr(*, class)= chr [1:3] VCorpus Corpus list


 Any idea on what I could do to keep the metadata in the merged corpus?

 Thanks,
 Henri-Paul


 --
 Henri-Paul Indiogine

 Curriculum  Instruction
 Texas AM University
 TutorFind Learning Centre

 Email: hindiog...@gmail.com
 Skype: hindiogine
 Website: http://people.cehd.tamu.edu/~sindiogine

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] merging corpora and metadata

2011-11-17 Thread R. Michael Weylandt michael.weyla...@gmail.com
What package is all this from()? 

You might check if there is a special rbind/cbind method provided. I don't 
think you can easily change the behavior of c()

Michael

On Nov 17, 2011, at 4:43 PM, Henri-Paul Indiogine hindiog...@gmail.com wrote:

 Greetings!
 
 I loose all my metadata after concatenating corpora. This is an
 example of what happens:
 
 meta(corpus.1)
   MetaID cid fid selfirst selend fname
 1   0   1  11 2169   2518WCPD-2001-01-29-Pg217.scrb
 2   0   1  14 9189   9702 WCPD-2003-01-13-Pg39.scrb
 3   0   1  14 2109   2577 WCPD-2003-01-13-Pg39.scrb
 
 
 
 
 17  0   1 11417863  18256WCPD-2007-04-30-Pg515.scrb
 
 
 meta(corpus.2)
   MetaID cid fid selfirst selend fname
 1   0   2   211016  11600   DCPD-200900595.scrb
 2   0   2   619510  20098   DCPD-201000636.scrb
 3   0   2   623935  24573   DCPD-201000636.scrb
 
 
 
 
 94  0   2 12716225  17128   WCPD-2009-01-12-Pg22-3.scrb
 
 
 tot.corpus - c(corpus.1, corpus.2)
 meta(tot.corpus)
 
MetaID
 10
 20
 30
 
 
 
 
 111  0
 
 
 This is from the structure of corpus.1
 
 ..$ MetaData:List of 2
  .. ..$ create_date: POSIXlt[1:1], format: 2011-11-17 21:09:57
  .. ..$ creator: chr henk
  ..$ Children: NULL
  ..- attr(*, class)= chr MetaDataNode
 - attr(*, DMetaData)='data.frame':17 obs. of  6 variables:
  ..$ MetaID  : num [1:17] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ cid : int [1:17] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ fid : int [1:17] 11 14 14 17 46 80 80 80 91 91 ...
  ..$ selfirst: num [1:17] 2169 9189 2109 8315 9439 ...
  ..$ selend  : num [1:17] 2518 9702 2577 8881 10102 ...
  ..$ fname   : chr [1:17] WCPD-2001-01-29-Pg217.scrb
 WCPD-2003-01-13-Pg39.scrb WCPD-2003-01-13-Pg39.scrb
 WCPD-2004-05-17-Pg856.scrb ...
 - attr(*, class)= chr [1:3] VCorpus Corpus list
 
 
 Any idea on what I could do to keep the metadata in the merged corpus?
 
 Thanks,
 Henri-Paul
 
 
 -- 
 Henri-Paul Indiogine
 
 Curriculum  Instruction
 Texas AM University
 TutorFind Learning Centre
 
 Email: hindiog...@gmail.com
 Skype: hindiogine
 Website: http://people.cehd.tamu.edu/~sindiogine
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] merging corpora and metadata

2011-11-17 Thread Joshua Wiley
Hi Michael,

require(sos)
findFn({meta}, sortby = Function)
## see that only two functions have the exact name, 'meta'
## one is titled, Meta Data Management in the package 'tm'
## seems a pretty likely choice

Also, the fact that it is a truly terrible idea does not mean it is not easy:

mvir - new.env()
mvir$c - function(x, ...) {cat(sure you can!\n); mean(x, ...)}
attach(mvir)

c(x = 1:10)
detach(mvir)

rm(mvir)

Cheers,

Josh


On Thu, Nov 17, 2011 at 5:25 PM, R. Michael Weylandt
michael.weyla...@gmail.com michael.weyla...@gmail.com wrote:
 What package is all this from()?

 You might check if there is a special rbind/cbind method provided. I don't 
 think you can easily change the behavior of c()

 Michael

 On Nov 17, 2011, at 4:43 PM, Henri-Paul Indiogine hindiog...@gmail.com 
 wrote:

 Greetings!

 I loose all my metadata after concatenating corpora. This is an
 example of what happens:

 meta(corpus.1)
   MetaID cid fid selfirst selend                         fname
 1       0   1  11     2169   2518    WCPD-2001-01-29-Pg217.scrb
 2       0   1  14     9189   9702     WCPD-2003-01-13-Pg39.scrb
 3       0   1  14     2109   2577     WCPD-2003-01-13-Pg39.scrb

 
 

 17      0   1 114    17863  18256    WCPD-2007-04-30-Pg515.scrb


 meta(corpus.2)
   MetaID cid fid selfirst selend                         fname
 1       0   2   2    11016  11600           DCPD-200900595.scrb
 2       0   2   6    19510  20098           DCPD-201000636.scrb
 3       0   2   6    23935  24573           DCPD-201000636.scrb

 
 

 94      0   2 127    16225  17128   WCPD-2009-01-12-Pg22-3.scrb


 tot.corpus - c(corpus.1, corpus.2)
 meta(tot.corpus)

    MetaID
 1        0
 2        0
 3        0

 
 

 111      0


 This is from the structure of corpus.1

 ..$ MetaData:List of 2
  .. ..$ create_date: POSIXlt[1:1], format: 2011-11-17 21:09:57
  .. ..$ creator    : chr henk
  ..$ Children: NULL
  ..- attr(*, class)= chr MetaDataNode
 - attr(*, DMetaData)='data.frame':    17 obs. of  6 variables:
  ..$ MetaID  : num [1:17] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ cid     : int [1:17] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ fid     : int [1:17] 11 14 14 17 46 80 80 80 91 91 ...
  ..$ selfirst: num [1:17] 2169 9189 2109 8315 9439 ...
  ..$ selend  : num [1:17] 2518 9702 2577 8881 10102 ...
  ..$ fname   : chr [1:17] WCPD-2001-01-29-Pg217.scrb
 WCPD-2003-01-13-Pg39.scrb WCPD-2003-01-13-Pg39.scrb
 WCPD-2004-05-17-Pg856.scrb ...
 - attr(*, class)= chr [1:3] VCorpus Corpus list


 Any idea on what I could do to keep the metadata in the merged corpus?

 Thanks,
 Henri-Paul


 --
 Henri-Paul Indiogine

 Curriculum  Instruction
 Texas AM University
 TutorFind Learning Centre

 Email: hindiog...@gmail.com
 Skype: hindiogine
 Website: http://people.cehd.tamu.edu/~sindiogine

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] merging corpora and metadata

2011-11-17 Thread R. Michael Weylandt
Hi Josh,

You're absolutely right. I suppose one could set up some sort of S3
thing for Henri's problem:

c - function(..., recursive = FALSE) UseMethod(c)
c.default - base::c
c.corpus - function(..., recursive = FALSE) {ans = c.default(...);
attributes(ans) - c(do.call(attributes, ...))}

But agreed, it seems deeply risky.

Cheers,

Michael

On Thu, Nov 17, 2011 at 9:01 PM, Joshua Wiley jwiley.ps...@gmail.com wrote:
 Hi Michael,

 require(sos)
 findFn({meta}, sortby = Function)
 ## see that only two functions have the exact name, 'meta'
 ## one is titled, Meta Data Management in the package 'tm'
 ## seems a pretty likely choice

 Also, the fact that it is a truly terrible idea does not mean it is not easy:

 mvir - new.env()
 mvir$c - function(x, ...) {cat(sure you can!\n); mean(x, ...)}
 attach(mvir)

 c(x = 1:10)
 detach(mvir)

 rm(mvir)

 Cheers,

 Josh


 On Thu, Nov 17, 2011 at 5:25 PM, R. Michael Weylandt
 michael.weyla...@gmail.com michael.weyla...@gmail.com wrote:
 What package is all this from()?

 You might check if there is a special rbind/cbind method provided. I don't 
 think you can easily change the behavior of c()

 Michael

 On Nov 17, 2011, at 4:43 PM, Henri-Paul Indiogine hindiog...@gmail.com 
 wrote:

 Greetings!

 I loose all my metadata after concatenating corpora. This is an
 example of what happens:

 meta(corpus.1)
   MetaID cid fid selfirst selend                         fname
 1       0   1  11     2169   2518    WCPD-2001-01-29-Pg217.scrb
 2       0   1  14     9189   9702     WCPD-2003-01-13-Pg39.scrb
 3       0   1  14     2109   2577     WCPD-2003-01-13-Pg39.scrb

 
 

 17      0   1 114    17863  18256    WCPD-2007-04-30-Pg515.scrb


 meta(corpus.2)
   MetaID cid fid selfirst selend                         fname
 1       0   2   2    11016  11600           DCPD-200900595.scrb
 2       0   2   6    19510  20098           DCPD-201000636.scrb
 3       0   2   6    23935  24573           DCPD-201000636.scrb

 
 

 94      0   2 127    16225  17128   WCPD-2009-01-12-Pg22-3.scrb


 tot.corpus - c(corpus.1, corpus.2)
 meta(tot.corpus)

    MetaID
 1        0
 2        0
 3        0

 
 

 111      0


 This is from the structure of corpus.1

 ..$ MetaData:List of 2
  .. ..$ create_date: POSIXlt[1:1], format: 2011-11-17 21:09:57
  .. ..$ creator    : chr henk
  ..$ Children: NULL
  ..- attr(*, class)= chr MetaDataNode
 - attr(*, DMetaData)='data.frame':    17 obs. of  6 variables:
  ..$ MetaID  : num [1:17] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ cid     : int [1:17] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ fid     : int [1:17] 11 14 14 17 46 80 80 80 91 91 ...
  ..$ selfirst: num [1:17] 2169 9189 2109 8315 9439 ...
  ..$ selend  : num [1:17] 2518 9702 2577 8881 10102 ...
  ..$ fname   : chr [1:17] WCPD-2001-01-29-Pg217.scrb
 WCPD-2003-01-13-Pg39.scrb WCPD-2003-01-13-Pg39.scrb
 WCPD-2004-05-17-Pg856.scrb ...
 - attr(*, class)= chr [1:3] VCorpus Corpus list


 Any idea on what I could do to keep the metadata in the merged corpus?

 Thanks,
 Henri-Paul


 --
 Henri-Paul Indiogine

 Curriculum  Instruction
 Texas AM University
 TutorFind Learning Centre

 Email: hindiog...@gmail.com
 Skype: hindiogine
 Website: http://people.cehd.tamu.edu/~sindiogine

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --
 Joshua Wiley
 Ph.D. Student, Health Psychology
 Programmer Analyst II, ATS Statistical Consulting Group
 University of California, Los Angeles
 https://joshuawiley.com/


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] merging corpora and metadata

2011-11-17 Thread Henri-Paul Indiogine
Hi Joshua!

2011/11/17 Joshua Wiley jwiley.ps...@gmail.com:
 One possibility (though it does not concatenate per se):

 combined - list(corpus.1, corpus.2)

Thanks I will look into it.


 *if* (there are only attributes in corpus.1 OR corpus.2) OR (the
 attribute names in corpus.1 and corpus.2 are unique), then you could
 do:

Unfortunately this is not the case.In the meanwhile I rewrote the
code that generates the corpus so that the documents are combined into
a single corpus _before_ the metadata are added.   That solved the
problem.

Thanks for your feedback and suggestions.

Henri-Paul



-- 
Henri-Paul Indiogine

Curriculum  Instruction
Texas AM University
TutorFind Learning Centre

Email: hindiog...@gmail.com
Skype: hindiogine
Website: http://people.cehd.tamu.edu/~sindiogine

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.