[R] help with LDA topic modelling..

2021-12-19 Thread akshay kulkarni
dear members,
 I am using LDA for topic modelling of news articles 
(topicmodels package). I am visualizing the accuracy with the LDAvis package.

The visualization shows clusters as circles, probably intersecting. My question 
is, if a find the optimal number of topics, k, and if the circles representing 
the topics doesn't intersect, then I have achieved perfect segregation. AM I 
right?

Thanking You,
Yours sincerely,
AKSHAY M KULKARNI

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Sum every n (4) observations by group

2021-12-19 Thread Avi Gross via R-help
Milu,

Your data seems to be very consistent in that each value of ID has eight
rows. You seem to want to just sum every four so that fits:

   ID Date   Value
1   A 4140 0.000207232
2   A 4141 0.000240141
3   A 4142 0.000271414
4   A 4143 0.000258384
5   A 4144 0.000243640
6   A 4145 0.000271480
7   A 4146 0.000280585
8   A 4147 0.000289691
9   B 4140 0.000298797
10  B 4141 0.000307903
11  B 4142 0.000317008
12  B 4143 0.000326114
13  B 4144 0.000335220
14  B 4145 0.000344326
15  B 4146 0.000353431
16  B 4147 0.000362537
17  C 4140 0.000371643
18  C 4141 0.000380749
19  C 4142 0.000389854
20  C 4143 0.000398960
21  C 4144 0.000408066
22  C 4145 0.000417172
23  C 4146 0.000426277
24  C 4147 0.000435383

There are many ways to do what you want, some more general than others, but
one trivial way is to add a column that contains 24 numbers ranging from 1
to 6 like this assuming mydf holds the above:

Here is an example of such a vector:

rep(1:(nrow(mydf)/4), each=4)
 [1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6

So you can add a column like:

> mydf$fours <- rep(1:(nrow(mydf)/4), each=4)
> mydf
   ID Date   Value fours
1   A 4140 0.000207232 1
2   A 4141 0.000240141 1
3   A 4142 0.000271414 1
4   A 4143 0.000258384 1
5   A 4144 0.000243640 2
6   A 4145 0.000271480 2
7   A 4146 0.000280585 2
8   A 4147 0.000289691 2
9   B 4140 0.000298797 3
10  B 4141 0.000307903 3
11  B 4142 0.000317008 3
12  B 4143 0.000326114 3
13  B 4144 0.000335220 4
14  B 4145 0.000344326 4
15  B 4146 0.000353431 4
16  B 4147 0.000362537 4
17  C 4140 0.000371643 5
18  C 4141 0.000380749 5
19  C 4142 0.000389854 5
20  C 4143 0.000398960 5
21  C 4144 0.000408066 6
22  C 4145 0.000417172 6
23  C 4146 0.000426277 6
24  C 4147 0.000435383 6

You now use grouping any way you want to apply a function and in this case
you want a sum. I like to use the tidyverse functions so will show that as
in:

mydf %>%
  group_by(ID, fours) %>%
  summarize(sums=sum(Value), n=n())

I threw in the extra column in case your data sometimes does not have 4 at
the end of a group or beginning of next. Here is the output:

# A tibble: 6 x 4
# Groups:   ID [3]
IDfours sums n
  
  1 A 1 0.000977 4
2 A 2 0.00109  4
3 B 3 0.00125  4
4 B 4 0.00140  4
5 C 5 0.00154  4
6 C 6 0.00169  4

Of course there are all kinds of ways to do this in standard R, including
trivial ones like looping over indices starting at 1 and taking four at a
time and getting the Value data for mydf$Value[N] + mydf$Value[N+1] ...



-Original Message-
From: R-help  On Behalf Of Miluji Sb
Sent: Sunday, December 19, 2021 1:32 PM
To: r-help mailing list 
Subject: [R] Sum every n (4) observations by group

Dear all,

I have a dataset (below) by ID and time sequence. I would like to sum every
four observations by ID.

I am confused how to combine the two conditions. Any help will be highly
appreciated. Thank you!

Best.

Milu

## Dataset
structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "C", "C"), Date =
c(4140L, 4141L, 4142L, 4143L, 4144L, 4145L, 4146L, 4147L, 4140L, 4141L,
4142L, 4143L, 4144L, 4145L, 4146L, 4147L, 4140L, 4141L, 4142L, 4143L, 4144L,
4145L, 4146L, 4147L ), Value = c(0.000207232, 0.000240141, 0.000271414,
0.000258384, 0.00024364, 0.00027148, 0.000280585, 0.000289691, 0.000298797,
0.000307903, 0.000317008, 0.000326114, 0.00033522, 0.000344326, 0.000353431,
0.000362537, 0.000371643, 0.000380749, 0.000389854, 0.00039896, 0.000408066,
0.000417172, 0.000426277, 0.000435383 )), class = "data.frame", row.names =
c(NA, -24L))

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Sum every n (4) observations by group

2021-12-19 Thread Miluji Sb
Dear Peter,

Thanks so much for your reply and the code! This is helpful.

What I would like is the data.frame below - sum values for *4140, 4141,
4142, 4143 *and then for *4144, 4145, 4146, 4147 *for IDs A, B, and C. Does
that make sense? Thanks again!

Best.

Milu

results <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C",
"C", "C", "C"), Date = c(4140L, 4141L, 4142L, 4143L, 4144L, 4145L,
4146L, 4147L, 4140L, 4141L, 4142L, 4143L, 4144L, 4145L, 4146L,
4147L, 4140L, 4141L, 4142L, 4143L, 4144L, 4145L, 4146L, 4147L
), Value = c(0.000207232, 0.000240141, 0.000271414, 0.000258384,
0.00024364, 0.00027148, 0.000280585, 0.000289691, 0.000298797,
0.000307903, 0.000317008, 0.000326114, 0.00033522, 0.000344326,
0.000353431, 0.000362537, 0.000371643, 0.000380749, 0.000389854,
0.00039896, 0.000408066, 0.000417172, 0.000426277, 0.000435383
), sum = c(NA, NA, NA, 0.000977171, NA, NA, NA, 0.001054089,
NA, NA, NA, 0.001213399, NA, NA, NA, 0.001395514, NA, NA, NA,
0.001541206, NA, NA, NA, 0.001686898)), class = "data.frame", row.names =
c(NA,
-24L))

On Sun, Dec 19, 2021 at 7:50 PM Peter Langfelder 
wrote:

> I'm not sure I understand the task, but if I do, assuming your data
> frame is assigned to a variable named df, I would do something like
>
> sumNs = function(x, n)
> {
>if (length(x) %%n !=0) stop("Length of 'x' must be a multiple of 'n'.")
>n1 = length(x)/n
>ind = rep(1:n1, each = n)
>tapply(x, ind, sum)
> }
> sums = tapply(df$Value, df$ID, sumNs, 4)
>
> Peter
>
> On Sun, Dec 19, 2021 at 10:32 AM Miluji Sb  wrote:
> >
> > Dear all,
> >
> > I have a dataset (below) by ID and time sequence. I would like to sum
> every
> > four observations by ID.
> >
> > I am confused how to combine the two conditions. Any help will be highly
> > appreciated. Thank you!
> >
> > Best.
> >
> > Milu
> >
> > ## Dataset
> > structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A",
> > "B", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C",
> > "C", "C", "C"), Date = c(4140L, 4141L, 4142L, 4143L, 4144L, 4145L,
> > 4146L, 4147L, 4140L, 4141L, 4142L, 4143L, 4144L, 4145L, 4146L,
> > 4147L, 4140L, 4141L, 4142L, 4143L, 4144L, 4145L, 4146L, 4147L
> > ), Value = c(0.000207232, 0.000240141, 0.000271414, 0.000258384,
> > 0.00024364, 0.00027148, 0.000280585, 0.000289691, 0.000298797,
> > 0.000307903, 0.000317008, 0.000326114, 0.00033522, 0.000344326,
> > 0.000353431, 0.000362537, 0.000371643, 0.000380749, 0.000389854,
> > 0.00039896, 0.000408066, 0.000417172, 0.000426277, 0.000435383
> > )), class = "data.frame", row.names = c(NA, -24L))
> >
> > [[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Sum every n (4) observations by group

2021-12-19 Thread Peter Langfelder
I'm not sure I understand the task, but if I do, assuming your data
frame is assigned to a variable named df, I would do something like

sumNs = function(x, n)
{
   if (length(x) %%n !=0) stop("Length of 'x' must be a multiple of 'n'.")
   n1 = length(x)/n
   ind = rep(1:n1, each = n)
   tapply(x, ind, sum)
}
sums = tapply(df$Value, df$ID, sumNs, 4)

Peter

On Sun, Dec 19, 2021 at 10:32 AM Miluji Sb  wrote:
>
> Dear all,
>
> I have a dataset (below) by ID and time sequence. I would like to sum every
> four observations by ID.
>
> I am confused how to combine the two conditions. Any help will be highly
> appreciated. Thank you!
>
> Best.
>
> Milu
>
> ## Dataset
> structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A",
> "B", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C",
> "C", "C", "C"), Date = c(4140L, 4141L, 4142L, 4143L, 4144L, 4145L,
> 4146L, 4147L, 4140L, 4141L, 4142L, 4143L, 4144L, 4145L, 4146L,
> 4147L, 4140L, 4141L, 4142L, 4143L, 4144L, 4145L, 4146L, 4147L
> ), Value = c(0.000207232, 0.000240141, 0.000271414, 0.000258384,
> 0.00024364, 0.00027148, 0.000280585, 0.000289691, 0.000298797,
> 0.000307903, 0.000317008, 0.000326114, 0.00033522, 0.000344326,
> 0.000353431, 0.000362537, 0.000371643, 0.000380749, 0.000389854,
> 0.00039896, 0.000408066, 0.000417172, 0.000426277, 0.000435383
> )), class = "data.frame", row.names = c(NA, -24L))
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Sum every n (4) observations by group

2021-12-19 Thread Miluji Sb
Dear all,

I have a dataset (below) by ID and time sequence. I would like to sum every
four observations by ID.

I am confused how to combine the two conditions. Any help will be highly
appreciated. Thank you!

Best.

Milu

## Dataset
structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C",
"C", "C", "C"), Date = c(4140L, 4141L, 4142L, 4143L, 4144L, 4145L,
4146L, 4147L, 4140L, 4141L, 4142L, 4143L, 4144L, 4145L, 4146L,
4147L, 4140L, 4141L, 4142L, 4143L, 4144L, 4145L, 4146L, 4147L
), Value = c(0.000207232, 0.000240141, 0.000271414, 0.000258384,
0.00024364, 0.00027148, 0.000280585, 0.000289691, 0.000298797,
0.000307903, 0.000317008, 0.000326114, 0.00033522, 0.000344326,
0.000353431, 0.000362537, 0.000371643, 0.000380749, 0.000389854,
0.00039896, 0.000408066, 0.000417172, 0.000426277, 0.000435383
)), class = "data.frame", row.names = c(NA, -24L))

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Bug in list.files(full.names=T)

2021-12-19 Thread Duncan Murdoch
I don't know the answer to your question, but I see the same behaviour 
on MacOS, e.g. list.files("./") includes ".//R" in the results on my 
system.  Both "./R" and ".//R" are legal ways to express that path on 
MacOS, so it's not a serious bug, but it does look ugly.


Duncan Murdoch

On 18/12/2021 9:55 a.m., Mario Reutter wrote:

Dear everybody,

I'm a researcher in the field of psychology and a passionate R user. After
having updated to the newest version, I experienced a problem with
list.files() if the parameter full.names is set to TRUE.
A path separator "/" is now always appended to path in the output even if
path %>% endsWith("/"). This breaks backwards compatibility in case path
ends with a path separator. The problem occurred somewhere between R
version 3.6.1 (2019-07-05) and 4.1.2 (2021-11-01).

Example:

list.files("C:/Data/", full.names=T)

C:/Data//file.csv

Expected behavior:
Either a path separator should never be appended in accordance with
the documentation: "full.names
a logical value. If TRUE, the directory path is prepended to the file names
to give a relative file path."
Or it could only be appended if path doesn't already end with a path
separator.

My question would now be if this warrants a bug report? And if you agree,
could someone issue the report since I'm not a member on Bugzilla?

Thank you and best regards,
Mario Reutter

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Speed up studentized confidence intervals ?

2021-12-19 Thread varin sacha via R-help
Dear R-experts,

Here below my R code working but really really slowly ! I need 2 hours with my 
computer to finally get an answer ! Is there a way to improve my R code to 
speed it up ? At least to win 1 hour ;=)

Many thanks


library(boot)

s<- sample(178:798, 10, replace=TRUE)
mean(s)

N <- 1000
out <- replicate(N, {
a<- sample(s,size=5)
mean(a)
dat<-data.frame(a)

med<-function(d,i) {
temp<-d[i,]
f<-mean(temp)
g<-var(replicate(50,mean(sample(temp,replace=T
return(c(f,g))

}

  boot.out <- boot(data = dat, statistic = med, R = 1)
  boot.ci(boot.out, type = "stud")$stud[, 4:5]
})
mean(out[1,] < mean(s) & mean(s) < out[2,]) 


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Bug in list.files(full.names=T)

2021-12-19 Thread Mario Reutter
Dear everybody,

I'm a researcher in the field of psychology and a passionate R user. After
having updated to the newest version, I experienced a problem with
list.files() if the parameter full.names is set to TRUE.
A path separator "/" is now always appended to path in the output even if
path %>% endsWith("/"). This breaks backwards compatibility in case path
ends with a path separator. The problem occurred somewhere between R
version 3.6.1 (2019-07-05) and 4.1.2 (2021-11-01).

Example:
>> list.files("C:/Data/", full.names=T)
C:/Data//file.csv

Expected behavior:
Either a path separator should never be appended in accordance with
the documentation: "full.names
a logical value. If TRUE, the directory path is prepended to the file names
to give a relative file path."
Or it could only be appended if path doesn't already end with a path
separator.

My question would now be if this warrants a bug report? And if you agree,
could someone issue the report since I'm not a member on Bugzilla?

Thank you and best regards,
Mario Reutter

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.