Re: [R] union of two sets are smaller than one set?

2021-01-31 Thread Avi Gross via R-help
Martin,

You did not say your two starting objects were already sets. You said they
were vectors of strings. It may well be that your strings included
duplicates. For example, If I read in lots of text with a blank line between
paragraphs, I would have lots of seemingly empty and identical parts. Just
converting that into a set would shrink it.

You have not said how you created or processed your initial two vectors. It
is also possible parts were sort of DELETED as in removing the string
pointed to by some entry but leaving a null pointer of sorts which would
leave the length of the vector longer than the useful contents.

Your strings seem to be what may be filenames. Are they unique, especially
if they are files in different folders/directories?

There are many ways to check, but using your method, try this:

length(base::union(s1, s1))

-Original Message-
From: R-help  On Behalf Of Martin Møller
Skarbiniks Pedersen
Sent: Sunday, January 31, 2021 3:57 PM
To: R mailing list 
Subject: [R] union of two sets are smaller than one set?

This is really puzzling me and when I try to make a small example everything
works like expected.

The problem:

I got these two large vectors of strings.

> str(s1)
 chr [1:766608] "0.dk" ...
> str(s2)
 chr [1:59387] "043.dk" "0606.dk" "0618.dk" "0888.dk" "0iq.dk" "0it.dk" ...

And I need to create the union-set of s1 and s2.
I expect the size of the union-set to be between 766608 and 766608+59387.
However it is 681193 which is less that number of elements in s1!

> length(base::union(s1, s2))
[1] 681193

Any hints?

Regards
Martin

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] union of two sets are smaller than one set?

2021-01-31 Thread Enrico Schumann
On Sun, 31 Jan 2021, Martin Møller Skarbiniks Pedersen writes:

> This is really puzzling me and when I try to make a small example
> everything works like expected.
>
> The problem:
>
> I got these two large vectors of strings.
>
>> str(s1)
>  chr [1:766608] "0.dk" ...
>> str(s2)
>  chr [1:59387] "043.dk" "0606.dk" "0618.dk" "0888.dk" "0iq.dk" "0it.dk" ...
>
> And I need to create the union-set of s1 and s2.
> I expect the size of the union-set to be between 766608 and 766608+59387.
> However it is 681193 which is less that number of elements in s1!
>
>> length(base::union(s1, s2))
> [1] 681193
>
> Any hints?
>
> Regards
> Martin
>

Duplicates?

kind regards
Enrico

-- 
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] union of two sets are smaller than one set?

2021-01-31 Thread Duncan Murdoch

On 31/01/2021 3:57 p.m., Martin Møller Skarbiniks Pedersen wrote:

This is really puzzling me and when I try to make a small example
everything works like expected.

The problem:

I got these two large vectors of strings.


str(s1)

  chr [1:766608] "0.dk" ...

str(s2)

  chr [1:59387] "043.dk" "0606.dk" "0618.dk" "0888.dk" "0iq.dk" "0it.dk" ...

And I need to create the union-set of s1 and s2.
I expect the size of the union-set to be between 766608 and 766608+59387.
However it is 681193 which is less that number of elements in s1!


length(base::union(s1, s2))

[1] 681193

Any hints?


I imagine unique(s1) is shorter than s1.  The union function is the same as

unique(c(s1, s2))

for your data.  (The only difference is if s1 or s2 is named:  the names 
are dropped.)


Duncan Murdoch

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] union of two sets are smaller than one set?

2021-01-31 Thread Martin Møller Skarbiniks Pedersen
This is really puzzling me and when I try to make a small example
everything works like expected.

The problem:

I got these two large vectors of strings.

> str(s1)
 chr [1:766608] "0.dk" ...
> str(s2)
 chr [1:59387] "043.dk" "0606.dk" "0618.dk" "0888.dk" "0iq.dk" "0it.dk" ...

And I need to create the union-set of s1 and s2.
I expect the size of the union-set to be between 766608 and 766608+59387.
However it is 681193 which is less that number of elements in s1!

> length(base::union(s1, s2))
[1] 681193

Any hints?

Regards
Martin

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.