subject:"\[R\] dealing with a messy dataset"

Re: [R] dealing with a messy dataset

2017-10-05 Thread Boris Steipe

Just for the record - and posterity: this is the Wrong way to go about defining 
a fixed width format and the strategy has a significant probability of 
corrupting data in ways that are hard to spot and hard to debug. If you _have_ 
the specification, then _use_ the specification.

Consider what you are actually doing here:

start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 70, 76, 78, 83, 88,
   93, 114, 122, 127)
diff(start)  # produces a vector of assumed column widths
# [1] 19 15  6  3  4  5  6  5  6  6  2  5  5  5 21  8  5

First:
Jim's solution uses fwf_widths(), which is not correct in the first place since 
not all positions of the record are actually specified: position 19, 35, 41, 
44, 48, 53, 59 etc. etc. are columns that are intentionally left blank to 
visually separate numbers in the file. fwf_widths() assumes there are no gaps 
in the data. Can you assume these blank columns are always " "? Maybe in your 
case you can get away with this, but in the general case that's a risky 
assumption to make. As you said, you only care about two columns anyway, and 
those happen to be correct - but that's all the more reason not to even touch 
the others.

Second:
Jim's solution is a visual approximation, based on a small number of sample 
rows you have provided. As the spec shows, this gets it wrong because your 
sample does not exhaust all possibilities and he has no way of knowing the 
semantics of the data. The spec has 23 columns of data, you are reading only 
17. For example the second column in that approach reads 15 characters, and now 
contains RAh, RAm, RAs, DE-, DEd, DEm and DEs all rolled into one string.

Third:
When writing code like this, you end up with a whole line of "magic numbers". 
This is unmaintainable. I would require my students to write something like the 
following, which comments what the columns _mean_:

cNames <- character()
cStarts <- integer()
cEnds <- integer()

cNames[1]  <- "Name" # Galaxy name in well-known catalogs
cStarts[1] <- 1L
cEnds[1]   <- 18L

cNames[2]  <- "logMHI" # Log hydrogen mass
cStarts[2] <- 78L
cEnds[2]   <- 82L

and then read

myCatalogue <- read_fwf(input,
col_positions = fwf_positions(cStart, cEnd),
col_names = cNames)

It may take you longer to type this up, but the chance of getting it right is 
much higher, and most importantly, it's crystal clear, explicit, extensible, 
and your summer student will see exactly what's going on here.

I hope this is helpful, and it comes closer to "how to deal with fixed-width 
files".
Cheers,

Boris

> On Oct 5, 2017, at 12:46 PM, jean-philippe 
>  wrote:
> 
> dear Jim,
> 
> 
> Yes I fixed the problem. Thanks again all of you for your contribution!
> This worked :
> 
> start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 70, 76, 78, 83, 88,
>   +93, 114, 122, 127)
> data1<-read_fwf("lvg_table2.txt",skip=70, fwf_widths(diff(start)))
> 
> Well now I know how to deal with fixed-width files :)
> 
> 
> Cheers
> 
> 
> Jean-Philippe
> 
> On 05/10/2017 18:42, jim holtman wrote:
>> You should be able to use that header information to create the
>> correct parameters to the read_fwf function to read in the data.
>> 
>> Jim Holtman
>> Data Munger Guru
>> 
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>> 
>> 
>> On Thu, Oct 5, 2017 at 11:02 AM, jean-philippe
>>  wrote:
>>> dear Jim,
>>> 
>>> Thanks for your reply and your proposition.
>>> 
>>> I forgot to provide the header of the dataframe, here it is:
>>> 
>>> Byte-by-byte Description of file: lvg_table2.dat
>>> 
>>>Bytes Format Units   Label   Explanations
>>> 
>>>1- 18 A18--- NameGalaxy name in well-known catalogs
>>>   20- 21 I2 h   RAh Hour of Right Ascension (J2000)
>>>   22- 23 I2 min RAm Minute of Right Ascension (J2000)
>>>   24- 27 F4.1   s   RAs Second of Right Ascension (J2000)
>>>   28 A1 --- DE- Sign of the Declination (J2000)
>>>   29- 30 I2 deg DEd Degree of Declination (J2000)
>>>   31- 32 I2 arcmin  DEm Arcminute of Declination (J2000)
>>>   33- 34 I2 arcsec  DEs Arcsecond of Declination (J2000)
>>>   36- 40 F5.2   kpc a26 ? Major linear diameter (1)
>>>   42- 43 I2 deg inc ? Inclination
>>>   45- 47 I3 km/sVm  ? Amplitude of rotational velocity (2)
>>>   49- 52 F4.2   mag AB  ? Internal B band extinction (3)
>>>   54- 58 F5.1   mag BMag? Absolute B band magnitude (4)
>>>   60- 63 F4.1   mag/arcsec2 SBB ? Average B band surface brightness (5)
>>>   65- 69 F5.2   [solLum]logKLum ? Log

Re: [R] dealing with a messy dataset

2017-10-05 Thread jean-philippe


dear Jim,


Yes I fixed the problem. Thanks again all of you for your contribution!
This worked :

start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 70, 76, 78, 83, 88,
   +93, 114, 122, 127)
data1<-read_fwf("lvg_table2.txt",skip=70, fwf_widths(diff(start)))

Well now I know how to deal with fixed-width files :)


Cheers


Jean-Philippe

On 05/10/2017 18:42, jim holtman wrote:

You should be able to use that header information to create the
correct parameters to the read_fwf function to read in the data.

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Thu, Oct 5, 2017 at 11:02 AM, jean-philippe
 wrote:

dear Jim,

Thanks for your reply and your proposition.

I forgot to provide the header of the dataframe, here it is:

Byte-by-byte Description of file: lvg_table2.dat

Bytes Format Units   Label   Explanations

1- 18 A18--- NameGalaxy name in well-known catalogs
   20- 21 I2 h   RAh Hour of Right Ascension (J2000)
   22- 23 I2 min RAm Minute of Right Ascension (J2000)
   24- 27 F4.1   s   RAs Second of Right Ascension (J2000)
   28 A1 --- DE- Sign of the Declination (J2000)
   29- 30 I2 deg DEd Degree of Declination (J2000)
   31- 32 I2 arcmin  DEm Arcminute of Declination (J2000)
   33- 34 I2 arcsec  DEs Arcsecond of Declination (J2000)
   36- 40 F5.2   kpc a26 ? Major linear diameter (1)
   42- 43 I2 deg inc ? Inclination
   45- 47 I3 km/sVm  ? Amplitude of rotational velocity (2)
   49- 52 F4.2   mag AB  ? Internal B band extinction (3)
   54- 58 F5.1   mag BMag? Absolute B band magnitude (4)
   60- 63 F4.1   mag/arcsec2 SBB ? Average B band surface brightness (5)
   65- 69 F5.2   [solLum]logKLum ? Log K_S_ band luminosity (6)
   71- 75 F5.2   [solMass]   logM26  ? Log mass within Holmberg radius (7)
   77 A1 ---   l_logMHI  Limit flag on logMHI
   78- 82 F5.2   [solMass]   logMHI  ? Log hydrogen mass (8)
   84- 87 I4 km/sVLG ? Radial velocity (9)
   89- 92 F4.1   --- Theta1  ? Tidal index (10)
   94-116 A23--- MD  Main disturber name (11)
  118-121 F4.1   --- Theta5  ? Another tidal index (12)
  123-127 F5.2   [-] Thetaj  ? Log K band luminosity density (13)


The idea for me is to select only the galaxy name and the logMHI values for
these galaxies, so quite a simple job when the dataset is tidy enough. I was
thinking as usual to use select from dplyr.
That is why I was just asking how to read this kind of files which, for me
so far, are uncommon.

Doing what you propose, it formats most of the columns correctly except few
ones, I will see how I can change some width to get it correctly:

   X1  X2X3X4X5X6X7X8 X9X10
X11   X12   X13   X14  X15   X16 X17
(chr)   (chr) (dbl) (int) (dbl) (dbl) (chr) (dbl) (chr)
(chr) (int) (chr) (chr) (chr)(chr) (dbl)   (chr)
1   UGC12894 22.5+392944  2.783321 0 -13.3  25.2 7.5 8  8.1
7   7.9 2  61 9 -1.3 NGC7640-1 0  0.12
2WLM 000158.1-152740  3.259022 0 -14.1 24.8 7.7 0 8.2
7   7.8 4  -1 6  0. 0 MESSIER031 0 2  1.75
3  And XVIII 000214.5+450520  0.6917 9 0  -8.7  26.8 6.4 4  6.7
8 < 6.6 5  -4 4  0. 5 MESSIER031 0 6  1.54
4  PAndAS-03 000356.4+405319  0.1017NA 0  -3.6  27.8 4.3  8
NANANA2. 8 MESSIER031 2 8  1.75
5  PAndAS-04 000442.9+472142  0.0522NA 0  -6.6  23.1 5.5  9
NANA   -10 8  2. 5 MESSIER031 2 5  1.75
6  PAndAS-05 000524.1+435535  0.0631NA 0  -4.5  25.6 4.7  5
NANA10 3  2. 8 MESSIER031 2 8  1.75
7 ESO409-015 000531.8-280553  3.007823 0 -14.6  24.1 8.1 0  8.2
5   8.1 0  76 9 -2.0 NGC0024-1 5 -2.05
8  AGC748778 000634.4+153039 0.61 70 3 0 -10.4  24.9 6.3 9  5.7
0   6.6 4  48 6 -1.9 NGC0253-1 5 -2.72
9 And XX 000730.7+350756  0.2033 5 0  -5.8  27.1 5.2 6  5.7
0NA   -18 2  2. 4 MESSIER031 2 4  1.75


Cheers, thanks again


Jean-Philippe
On 05/10/2017 16:49, jim holtman wrote:

start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 69, 75, 77, 82, 87,
  +92, 114, 121, 127)
  > read_fwf(input, fwf_widths(diff(start)))


--
Jean-Philippe Fontaine
PhD Student in Astroparticle Physics,
Gran Sasso Science Institute (GSSI),
Viale Francesco Crispi 7,
67100 L'Aquila, Italy
Mobile: +393

Re: [R] dealing with a messy dataset

2017-10-05 Thread jim holtman

You should be able to use that header information to create the
correct parameters to the read_fwf function to read in the data.

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Thu, Oct 5, 2017 at 11:02 AM, jean-philippe
 wrote:
> dear Jim,
>
> Thanks for your reply and your proposition.
>
> I forgot to provide the header of the dataframe, here it is:
> 
> Byte-by-byte Description of file: lvg_table2.dat
> 
>Bytes Format Units   Label   Explanations
> 
>1- 18 A18--- NameGalaxy name in well-known catalogs
>   20- 21 I2 h   RAh Hour of Right Ascension (J2000)
>   22- 23 I2 min RAm Minute of Right Ascension (J2000)
>   24- 27 F4.1   s   RAs Second of Right Ascension (J2000)
>   28 A1 --- DE- Sign of the Declination (J2000)
>   29- 30 I2 deg DEd Degree of Declination (J2000)
>   31- 32 I2 arcmin  DEm Arcminute of Declination (J2000)
>   33- 34 I2 arcsec  DEs Arcsecond of Declination (J2000)
>   36- 40 F5.2   kpc a26 ? Major linear diameter (1)
>   42- 43 I2 deg inc ? Inclination
>   45- 47 I3 km/sVm  ? Amplitude of rotational velocity (2)
>   49- 52 F4.2   mag AB  ? Internal B band extinction (3)
>   54- 58 F5.1   mag BMag? Absolute B band magnitude (4)
>   60- 63 F4.1   mag/arcsec2 SBB ? Average B band surface brightness (5)
>   65- 69 F5.2   [solLum]logKLum ? Log K_S_ band luminosity (6)
>   71- 75 F5.2   [solMass]   logM26  ? Log mass within Holmberg radius (7)
>   77 A1 ---   l_logMHI  Limit flag on logMHI
>   78- 82 F5.2   [solMass]   logMHI  ? Log hydrogen mass (8)
>   84- 87 I4 km/sVLG ? Radial velocity (9)
>   89- 92 F4.1   --- Theta1  ? Tidal index (10)
>   94-116 A23--- MD  Main disturber name (11)
>  118-121 F4.1   --- Theta5  ? Another tidal index (12)
>  123-127 F5.2   [-] Thetaj  ? Log K band luminosity density (13)
> 
>
> The idea for me is to select only the galaxy name and the logMHI values for
> these galaxies, so quite a simple job when the dataset is tidy enough. I was
> thinking as usual to use select from dplyr.
> That is why I was just asking how to read this kind of files which, for me
> so far, are uncommon.
>
> Doing what you propose, it formats most of the columns correctly except few
> ones, I will see how I can change some width to get it correctly:
>
>   X1  X2X3X4X5X6X7X8 X9X10
> X11   X12   X13   X14  X15   X16 X17
>(chr)   (chr) (dbl) (int) (dbl) (dbl) (chr) (dbl) (chr)
> (chr) (int) (chr) (chr) (chr)(chr) (dbl)   (chr)
> 1   UGC12894 22.5+392944  2.783321 0 -13.3  25.2 7.5 8  8.1
> 7   7.9 2  61 9 -1.3 NGC7640-1 0  0.12
> 2WLM 000158.1-152740  3.259022 0 -14.1 24.8 7.7 0 8.2
> 7   7.8 4  -1 6  0. 0 MESSIER031 0 2  1.75
> 3  And XVIII 000214.5+450520  0.6917 9 0  -8.7  26.8 6.4 4  6.7
> 8 < 6.6 5  -4 4  0. 5 MESSIER031 0 6  1.54
> 4  PAndAS-03 000356.4+405319  0.1017NA 0  -3.6  27.8 4.3  8
> NANANA2. 8 MESSIER031 2 8  1.75
> 5  PAndAS-04 000442.9+472142  0.0522NA 0  -6.6  23.1 5.5  9
> NANA   -10 8  2. 5 MESSIER031 2 5  1.75
> 6  PAndAS-05 000524.1+435535  0.0631NA 0  -4.5  25.6 4.7  5
> NANA10 3  2. 8 MESSIER031 2 8  1.75
> 7 ESO409-015 000531.8-280553  3.007823 0 -14.6  24.1 8.1 0  8.2
> 5   8.1 0  76 9 -2.0 NGC0024-1 5 -2.05
> 8  AGC748778 000634.4+153039 0.61 70 3 0 -10.4  24.9 6.3 9  5.7
> 0   6.6 4  48 6 -1.9 NGC0253-1 5 -2.72
> 9 And XX 000730.7+350756  0.2033 5 0  -5.8  27.1 5.2 6  5.7
> 0NA   -18 2  2. 4 MESSIER031 2 4  1.75
>
>
> Cheers, thanks again
>
>
> Jean-Philippe
> On 05/10/2017 16:49, jim holtman wrote:
>>
>> start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 69, 75, 77, 82, 87,
>>  +92, 114, 121, 127)
>>  > read_fwf(input, fwf_widths(diff(start)))
>
>
> --
> Jean-Philippe Fontaine
> PhD Student in Astroparticle Physics,
> Gran Sasso Science Institute (GSSI),
> Viale Francesco Crispi 7,
> 67100 L'Aquila, Italy
> Mobile: +393487128593, +33615653774
>

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
a

Re: [R] dealing with a messy dataset

2017-10-05 Thread Boris Steipe

Since you have an authoritative description of the format, by all means use 
that - not a guess based on a visual inspection of where data appears in a 
sample row.

B. 




> On Oct 5, 2017, at 11:02 AM, jean-philippe 
>  wrote:
> 
> dear Jim,
> 
> Thanks for your reply and your proposition.
> 
> I forgot to provide the header of the dataframe, here it is:
> 
> Byte-by-byte Description of file: lvg_table2.dat
> 
>   Bytes Format Units   Label   Explanations
> 
>   1- 18 A18--- NameGalaxy name in well-known catalogs
>  20- 21 I2 h   RAh Hour of Right Ascension (J2000)
>  22- 23 I2 min RAm Minute of Right Ascension (J2000)
>  24- 27 F4.1   s   RAs Second of Right Ascension (J2000)
>  28 A1 --- DE- Sign of the Declination (J2000)
>  29- 30 I2 deg DEd Degree of Declination (J2000)
>  31- 32 I2 arcmin  DEm Arcminute of Declination (J2000)
>  33- 34 I2 arcsec  DEs Arcsecond of Declination (J2000)
>  36- 40 F5.2   kpc a26 ? Major linear diameter (1)
>  42- 43 I2 deg inc ? Inclination
>  45- 47 I3 km/sVm  ? Amplitude of rotational velocity (2)
>  49- 52 F4.2   mag AB  ? Internal B band extinction (3)
>  54- 58 F5.1   mag BMag? Absolute B band magnitude (4)
>  60- 63 F4.1   mag/arcsec2 SBB ? Average B band surface brightness (5)
>  65- 69 F5.2   [solLum]logKLum ? Log K_S_ band luminosity (6)
>  71- 75 F5.2   [solMass]   logM26  ? Log mass within Holmberg radius (7)
>  77 A1 ---   l_logMHI  Limit flag on logMHI
>  78- 82 F5.2   [solMass]   logMHI  ? Log hydrogen mass (8)
>  84- 87 I4 km/sVLG ? Radial velocity (9)
>  89- 92 F4.1   --- Theta1  ? Tidal index (10)
>  94-116 A23--- MD  Main disturber name (11)
> 118-121 F4.1   --- Theta5  ? Another tidal index (12)
> 123-127 F5.2   [-] Thetaj  ? Log K band luminosity density (13)
> 
> 
> The idea for me is to select only the galaxy name and the logMHI values for 
> these galaxies, so quite a simple job when the dataset is tidy enough. I was 
> thinking as usual to use select from dplyr.
> That is why I was just asking how to read this kind of files which, for me so 
> far, are uncommon.
> 
> Doing what you propose, it formats most of the columns correctly except few 
> ones, I will see how I can change some width to get it correctly:
> 
>  X1  X2X3X4X5X6X7X8 X9X10   
> X11   X12   X13   X14  X15   X16 X17
>   (chr)   (chr) (dbl) (int) (dbl) (dbl) (chr) (dbl) (chr)  (chr) 
> (int) (chr) (chr) (chr)(chr) (dbl)   (chr)
> 1   UGC12894 22.5+392944  2.783321 0 -13.3  25.2 7.5 8  8.1   
>   7   7.9 2  61 9 -1.3 NGC7640-1 0  0.12
> 2WLM 000158.1-152740  3.259022 0 -14.1  24.8 7.7 0  8.2   
>   7   7.8 4  -1 6  0. 0 MESSIER031 0 2  1.75
> 3  And XVIII 000214.5+450520  0.6917 9 0  -8.7  26.8 6.4 4  6.7   
>   8 < 6.6 5  -4 4  0. 5 MESSIER031 0 6  1.54
> 4  PAndAS-03 000356.4+405319  0.1017NA 0  -3.6  27.8 4.3  8   
>  NANANA2. 8 MESSIER031 2 8  1.75
> 5  PAndAS-04 000442.9+472142  0.0522NA 0  -6.6  23.1 5.5  9   
>  NANA   -10 8  2. 5 MESSIER031 2 5  1.75
> 6  PAndAS-05 000524.1+435535  0.0631NA 0  -4.5  25.6 4.7  5   
>  NANA10 3  2. 8 MESSIER031 2 8  1.75
> 7 ESO409-015 000531.8-280553  3.007823 0 -14.6  24.1 8.1 0  8.2   
>   5   8.1 0  76 9 -2.0 NGC0024-1 5 -2.05
> 8  AGC748778 000634.4+153039  0.6170 3 0 -10.4  24.9 6.3 9  5.7   
>   0   6.6 4  48 6 -1.9 NGC0253-1 5 -2.72
> 9 And XX 000730.7+350756  0.2033 5 0  -5.8  27.1 5.2 6  5.7   
>   0NA   -18 2  2. 4 MESSIER031 2 4  1.75
> 
> 
> Cheers, thanks again
> 
> 
> Jean-Philippe
> On 05/10/2017 16:49, jim holtman wrote:
>> start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 69, 75, 77, 82, 87,
>> +92, 114, 121, 127)
>> > read_fwf(input, fwf_widths(diff(start)))
> 
> -- 
> Jean-Philippe Fontaine
> PhD Student in Astroparticle Physics,
> Gran Sasso Science Institute (GSSI),
> Viale Francesco Crispi 7,
> 67100 L'Aquila, Italy
> Mobile: +393487128593, +33615653774
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained

Re: [R] dealing with a messy dataset

2017-10-05 Thread jean-philippe


dear Jim,

Thanks for your reply and your proposition.

I forgot to provide the header of the dataframe, here it is:

Byte-by-byte Description of file: lvg_table2.dat

   Bytes Format Units   Label   Explanations

   1- 18 A18--- NameGalaxy name in well-known catalogs
  20- 21 I2 h   RAh Hour of Right Ascension (J2000)
  22- 23 I2 min RAm Minute of Right Ascension (J2000)
  24- 27 F4.1   s   RAs Second of Right Ascension (J2000)
  28 A1 --- DE- Sign of the Declination (J2000)
  29- 30 I2 deg DEd Degree of Declination (J2000)
  31- 32 I2 arcmin  DEm Arcminute of Declination (J2000)
  33- 34 I2 arcsec  DEs Arcsecond of Declination (J2000)
  36- 40 F5.2   kpc a26 ? Major linear diameter (1)
  42- 43 I2 deg inc ? Inclination
  45- 47 I3 km/sVm  ? Amplitude of rotational velocity (2)
  49- 52 F4.2   mag AB  ? Internal B band extinction (3)
  54- 58 F5.1   mag BMag? Absolute B band magnitude (4)
  60- 63 F4.1   mag/arcsec2 SBB ? Average B band surface brightness (5)
  65- 69 F5.2   [solLum]logKLum ? Log K_S_ band luminosity (6)
  71- 75 F5.2   [solMass]   logM26  ? Log mass within Holmberg radius (7)
  77 A1 ---   l_logMHI  Limit flag on logMHI
  78- 82 F5.2   [solMass]   logMHI  ? Log hydrogen mass (8)
  84- 87 I4 km/sVLG ? Radial velocity (9)
  89- 92 F4.1   --- Theta1  ? Tidal index (10)
  94-116 A23--- MD  Main disturber name (11)
 118-121 F4.1   --- Theta5  ? Another tidal index (12)
 123-127 F5.2   [-] Thetaj  ? Log K band luminosity density (13)


The idea for me is to select only the galaxy name and the logMHI values 
for these galaxies, so quite a simple job when the dataset is tidy 
enough. I was thinking as usual to use select from dplyr.
That is why I was just asking how to read this kind of files which, for 
me so far, are uncommon.


Doing what you propose, it formats most of the columns correctly except 
few ones, I will see how I can change some width to get it correctly:


  X1  X2X3X4X5X6X7X8 X9
X10   X11   X12   X13   X14  X15   X16 X17
   (chr)   (chr) (dbl) (int) (dbl) (dbl) (chr) (dbl) (chr)  
(chr) (int) (chr) (chr) (chr)(chr) (dbl)   (chr)
1   UGC12894 22.5+392944  2.783321 0 -13.3  25.2 7.5 8  
8.1 7   7.9 2  61 9 -1.3 NGC7640-1 0  0.12
2WLM 000158.1-152740  3.259022 0 -14.1  24.8 7.7 0  
8.2 7   7.8 4  -1 6  0. 0 MESSIER031 0 2  1.75
3  And XVIII 000214.5+450520  0.6917 9 0  -8.7  26.8 6.4 4  
6.7 8 < 6.6 5  -4 4  0. 5 MESSIER031 0 6  1.54
4  PAndAS-03 000356.4+405319  0.1017NA 0  -3.6  27.8 
4.3  8NANANA2. 8 MESSIER031 2 8  1.75
5  PAndAS-04 000442.9+472142  0.0522NA 0  -6.6  23.1 
5.5  9NANA   -10 8  2. 5 MESSIER031 2 5  1.75
6  PAndAS-05 000524.1+435535  0.0631NA 0  -4.5  25.6 
4.7  5NANA10 3  2. 8 MESSIER031 2 8  1.75
7 ESO409-015 000531.8-280553  3.007823 0 -14.6  24.1 8.1 0  
8.2 5   8.1 0  76 9 -2.0 NGC0024-1 5 -2.05
8  AGC748778 000634.4+153039  0.6170 3 0 -10.4  24.9 6.3 9  
5.7 0   6.6 4  48 6 -1.9 NGC0253-1 5 -2.72
9 And XX 000730.7+350756  0.2033 5 0  -5.8  27.1 5.2 6  
5.7 0NA   -18 2  2. 4 MESSIER031 2 4  1.75



Cheers, thanks again


Jean-Philippe
On 05/10/2017 16:49, jim holtman wrote:

start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 69, 75, 77, 82, 87,
 +92, 114, 121, 127)
 > read_fwf(input, fwf_widths(diff(start)))


--
Jean-Philippe Fontaine
PhD Student in Astroparticle Physics,
Gran Sasso Science Institute (GSSI),
Viale Francesco Crispi 7,
67100 L'Aquila, Italy
Mobile: +393487128593, +33615653774

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] dealing with a messy dataset

2017-10-05 Thread jim holtman

It looks like fixed width.  I just used the last position of each
field to get the size and used the 'readr' package;

> input <- "And XVIII  000214.5+450520  0.69 17   9 0.00
-8.7 26.8 6.44  6.78 < 6.65  -44  0.5 MESSIER031   0.6
1.54
+ PAndAS-03  000356.4+405319  0.10 17 0.00  -3.6 27.8
4.382.8 MESSIER031   2.8  1.75
+ PAndAS-04  000442.9+472142  0.05 22 0.00  -6.6 23.1
5.59  -108  2.5 MESSIER031   2.5  1.75
+ PAndAS-05  000524.1+435535  0.06 31 0.00  -4.5 25.6
4.75   103  2.8 MESSIER031   2.8  1.75
+ ESO409-015 000531.8-280553  3.00 78  23 0.00 -14.6 24.1
8.10  8.25   8.10  769 -2.0 NGC0024 -1.5 -2.05
+ AGC748778  000634.4+153039  0.61 70   3 0.00 -10.4 24.9
6.39  5.70   6.64  486 -1.9 NGC0253 -1.5 -2.72
+ And XX 000730.7+350756  0.20 33   5 0.00  -5.8 27.1
5.26  5.70-182  2.4 MESSIER031   2.4  1.75"
>
> start <- c(1, 20, 35, 41, 44, 48, 53, 59, 64, 69, 75, 77, 82, 87,
+92, 114, 121, 127)
> read_fwf(input, fwf_widths(diff(start)))
# A tibble: 7 x 17
  X1  X2X3X4X5X6X7X8
 X9   X10   X11   X12   X13   X14

 
1  And XVIII 000214.5+450520  0.6917 9 0  -8.7  26.8
6.44  6.78 <  6.65   -44   0.5
2  PAndAS-03 000356.4+405319  0.1017NA 0  -3.6  27.8
4.38NA  NANA   2.8
3  PAndAS-04 000442.9+472142  0.0522NA 0  -6.6  23.1
5.59NA  NA  -108   2.5
4  PAndAS-05 000524.1+435535  0.0631NA 0  -4.5  25.6
4.75NA  NA   103   2.8
5 ESO409-015 000531.8-280553  3.007823 0 -14.6  24.1
8.10  8.258.10   769  -2.0
6  AGC748778 000634.4+153039  0.6170 3 0 -10.4  24.9
6.39  5.706.64   486  -1.9
7 And XX 000730.7+350756  0.2033 5 0  -5.8  27.1
5.26  5.70  NA  -182   2.4
# ... with 3 more variables: X15 , X16 , X17 
>


Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Thu, Oct 5, 2017 at 10:12 AM, jean-philippe
 wrote:
> dear R-users,
>
>
> I am facing a quite regular and basic problem when it comes to dealing with
> datasets, but I cannot find any satisfying answer so far.
> I have a messy dataset of galaxies like that :
>
> And XVIII  000214.5+450520  0.69 17   9 0.00  -8.7 26.8 6.44  6.78 <
> 6.65  -44  0.5 MESSIER031   0.6  1.54
> PAndAS-03  000356.4+405319  0.10 17 0.00  -3.6 27.8 4.38
> 2.8 MESSIER031   2.8  1.75
> PAndAS-04  000442.9+472142  0.05 22 0.00  -6.6 23.1 5.59
> -108  2.5 MESSIER031   2.5  1.75
> PAndAS-05  000524.1+435535  0.06 31 0.00  -4.5 25.6 4.75
> 103  2.8 MESSIER031   2.8  1.75
> ESO409-015 000531.8-280553  3.00 78  23 0.00 -14.6 24.1 8.10  8.25
> 8.10  769 -2.0 NGC0024 -1.5 -2.05
> AGC748778  000634.4+153039  0.61 70   3 0.00 -10.4 24.9 6.39  5.70
> 6.64  486 -1.9 NGC0253 -1.5 -2.72
> And XX 000730.7+350756  0.20 33   5 0.00  -5.8 27.1 5.26  5.70
> -182  2.4 MESSIER031   2.4  1.75
>
> What I would like to do is to read this dataset, but I would like that the
> space between And and XVIII is not interpreted as 2 different columns but as
> the name of the galaxy in one column.
> How is it possible to do so?
>
> For instance I did this data1<-read.table("lvg_table2.txt",skip=70,fill=T)
> where I used fill=T because the rows don't have the same number of features
> since R splits the name of the galaxies into 2 columns because of the space.
>
>
> Best Regards, thanks in advance
>
>
> Jean-Philippe Fontaine
>
> --
> Jean-Philippe Fontaine
> PhD Student in Astroparticle Physics,
> Gran Sasso Science Institute (GSSI),
> Viale Francesco Crispi 7,
> 67100 L'Aquila, Italy
> Mobile: +393487128593, +33615653774
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] dealing with a messy dataset

2017-10-05 Thread jean-philippe


dear Boris,

Thanks for your answer!

Yes it seems to  be a fixed-width format. I didn't remember this type of 
datasets since I am not used to analyze and process them.


Thanks anyway, it seems to fix the problem (I just need to reflect a bit 
more on the width of each feature)!



Cheers

Jean-Philippe

On 05/10/2017 16:19, Boris Steipe wrote:

Is this a fixed width format?
If so, read.fwf() in base, or read_fwf() in the readr package will solve the 
problem. You may need to trim trailing spaces though.


B.





On Oct 5, 2017, at 10:12 AM, jean-philippe  
wrote:

dear R-users,


I am facing a quite regular and basic problem when it comes to dealing with 
datasets, but I cannot find any satisfying answer so far.
I have a messy dataset of galaxies like that :

And XVIII  000214.5+450520  0.69 17   9 0.00  -8.7 26.8 6.44  6.78 < 
6.65  -44  0.5 MESSIER031   0.6  1.54
PAndAS-03  000356.4+405319  0.10 17 0.00  -3.6 27.8 4.38
2.8 MESSIER031   2.8  1.75
PAndAS-04  000442.9+472142  0.05 22 0.00  -6.6 23.1 5.59
  -108  2.5 MESSIER031   2.5  1.75
PAndAS-05  000524.1+435535  0.06 31 0.00  -4.5 25.6 4.75
   103  2.8 MESSIER031   2.8  1.75
ESO409-015 000531.8-280553  3.00 78  23 0.00 -14.6 24.1 8.10  8.25   
8.10  769 -2.0 NGC0024 -1.5 -2.05
AGC748778  000634.4+153039  0.61 70   3 0.00 -10.4 24.9 6.39  5.70   
6.64  486 -1.9 NGC0253 -1.5 -2.72
And XX 000730.7+350756  0.20 33   5 0.00  -5.8 27.1 5.26  5.70  
  -182  2.4 MESSIER031   2.4  1.75

What I would like to do is to read this dataset, but I would like that the 
space between And and XVIII is not interpreted as 2 different columns but as 
the name of the galaxy in one column.
How is it possible to do so?

For instance I did this data1<-read.table("lvg_table2.txt",skip=70,fill=T) 
where I used fill=T because the rows don't have the same number of features since R splits 
the name of the galaxies into 2 columns because of the space.


Best Regards, thanks in advance


Jean-Philippe Fontaine

--
Jean-Philippe Fontaine
PhD Student in Astroparticle Physics,
Gran Sasso Science Institute (GSSI),
Viale Francesco Crispi 7,
67100 L'Aquila, Italy
Mobile: +393487128593, +33615653774

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Jean-Philippe Fontaine
PhD Student in Astroparticle Physics,
Gran Sasso Science Institute (GSSI),
Viale Francesco Crispi 7,
67100 L'Aquila, Italy
Mobile: +393487128593, +33615653774

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] dealing with a messy dataset

2017-10-05 Thread Boris Steipe

Is this a fixed width format?
If so, read.fwf() in base, or read_fwf() in the readr package will solve the 
problem. You may need to trim trailing spaces though.


B.




> On Oct 5, 2017, at 10:12 AM, jean-philippe 
>  wrote:
> 
> dear R-users,
> 
> 
> I am facing a quite regular and basic problem when it comes to dealing with 
> datasets, but I cannot find any satisfying answer so far.
> I have a messy dataset of galaxies like that :
> 
> And XVIII  000214.5+450520  0.69 17   9 0.00  -8.7 26.8 6.44  6.78 < 
> 6.65  -44  0.5 MESSIER031   0.6  1.54
> PAndAS-03  000356.4+405319  0.10 17 0.00  -3.6 27.8 4.38  
>   2.8 MESSIER031   2.8  1.75
> PAndAS-04  000442.9+472142  0.05 22 0.00  -6.6 23.1 5.59  
> -108  2.5 MESSIER031   2.5  1.75
> PAndAS-05  000524.1+435535  0.06 31 0.00  -4.5 25.6 4.75  
>  103  2.8 MESSIER031   2.8  1.75
> ESO409-015 000531.8-280553  3.00 78  23 0.00 -14.6 24.1 8.10  8.25   
> 8.10  769 -2.0 NGC0024 -1.5 -2.05
> AGC748778  000634.4+153039  0.61 70   3 0.00 -10.4 24.9 6.39  5.70   
> 6.64  486 -1.9 NGC0253 -1.5 -2.72
> And XX 000730.7+350756  0.20 33   5 0.00  -5.8 27.1 5.26  5.70
> -182  2.4 MESSIER031   2.4  1.75
> 
> What I would like to do is to read this dataset, but I would like that the 
> space between And and XVIII is not interpreted as 2 different columns but as 
> the name of the galaxy in one column.
> How is it possible to do so?
> 
> For instance I did this data1<-read.table("lvg_table2.txt",skip=70,fill=T) 
> where I used fill=T because the rows don't have the same number of features 
> since R splits the name of the galaxies into 2 columns because of the space.
> 
> 
> Best Regards, thanks in advance
> 
> 
> Jean-Philippe Fontaine
> 
> -- 
> Jean-Philippe Fontaine
> PhD Student in Astroparticle Physics,
> Gran Sasso Science Institute (GSSI),
> Viale Francesco Crispi 7,
> 67100 L'Aquila, Italy
> Mobile: +393487128593, +33615653774
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] dealing with a messy dataset

2017-10-05 Thread jean-philippe


dear R-users,


I am facing a quite regular and basic problem when it comes to dealing 
with datasets, but I cannot find any satisfying answer so far.

I have a messy dataset of galaxies like that :

And XVIII  000214.5+450520  0.69 17   9 0.00  -8.7 26.8 6.44  
6.78 < 6.65  -44  0.5 MESSIER031   0.6  1.54
PAndAS-03  000356.4+405319  0.10 17 0.00  -3.6 27.8 
4.382.8 MESSIER031   2.8  1.75
PAndAS-04  000442.9+472142  0.05 22 0.00  -6.6 23.1 
5.59  -108  2.5 MESSIER031   2.5  1.75
PAndAS-05  000524.1+435535  0.06 31 0.00  -4.5 25.6 
4.75   103  2.8 MESSIER031   2.8  1.75
ESO409-015 000531.8-280553  3.00 78  23 0.00 -14.6 24.1 8.10  
8.25   8.10  769 -2.0 NGC0024 -1.5 -2.05
AGC748778  000634.4+153039  0.61 70   3 0.00 -10.4 24.9 6.39  
5.70   6.64  486 -1.9 NGC0253 -1.5 -2.72
And XX 000730.7+350756  0.20 33   5 0.00  -5.8 27.1 5.26  
5.70-182  2.4 MESSIER031   2.4  1.75


What I would like to do is to read this dataset, but I would like that 
the space between And and XVIII is not interpreted as 2 different 
columns but as the name of the galaxy in one column.

How is it possible to do so?

For instance I did this 
data1<-read.table("lvg_table2.txt",skip=70,fill=T) where I used fill=T 
because the rows don't have the same number of features since R splits 
the name of the galaxies into 2 columns because of the space.



Best Regards, thanks in advance


Jean-Philippe Fontaine

--
Jean-Philippe Fontaine
PhD Student in Astroparticle Physics,
Gran Sasso Science Institute (GSSI),
Viale Francesco Crispi 7,
67100 L'Aquila, Italy
Mobile: +393487128593, +33615653774

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] dealing with a messy dataset

Re: [R] dealing with a messy dataset

Re: [R] dealing with a messy dataset

Re: [R] dealing with a messy dataset

Re: [R] dealing with a messy dataset

Re: [R] dealing with a messy dataset

Re: [R] dealing with a messy dataset

Re: [R] dealing with a messy dataset

[R] dealing with a messy dataset

9 matches

Site Navigation

Mail list logo

Footer information