subject:"Re\: \[PLUG\] Correcting duplicate strings in files"

Re: [PLUG] Correcting duplicate strings in files

2018-06-20 Thread Rich Shepard


On Tue, 19 Jun 2018, david wrote:


cat $file | uniq -u > $outfile


David,

  The above prints only the unique lines. I, too, have used this after grep
to remove duplicates.

  With the data referenced in this thread uniq will not do the job because
either each line is unique as a whole (same date, same hour, different
value) or considered a duplicate (same date, same hour, same value). The
former removes nothing, the latter removes the value for 5:00 pm.

  The need is to replace only the second duplicated 16:00 hour with 17:00.

Regards,

Rich
___
PLUG mailing list
PLUG@pdxlinux.org
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] Correcting duplicate strings in files

2018-06-19 Thread david


On 06/19/2018 06:02 PM, Rich Shepard wrote:

On Tue, 19 Jun 2018, david wrote:

While I believe the answer has already been found, would the 'uniq' 
command have been useful as an alternative?


david,

   Good question. Can it find a difference in a specific field and change
only one of them? Perhaps, but I've no idea.


Without a bigger sample size of data from you, I'm not sure.

I use the uniq command a lot when I pull a list of stuff (usually IPs 
and more) with grep or other utilities from log files and then pipe 
things through uniq to get a count of times an entry is found (-c flag).


Provided all data lines are unique, except for your one duplicate line, 
then yes, you could use this. A crude, but effective approach to test 
would be:


cat $file | uniq -u > $outfile

There are a lot of approaches, and I like the awk approach. This might 
just be another tool for you to use in the future to satisfy other needs.


david
___
PLUG mailing list
PLUG@pdxlinux.org
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] Correcting duplicate strings in files

2018-06-19 Thread Rich Shepard


On Tue, 19 Jun 2018, david wrote:

While I believe the answer has already been found, would the 'uniq' command 
have been useful as an alternative?


david,

  Good question. Can it find a difference in a specific field and change
only one of them? Perhaps, but I've no idea.

Thanks,

Rich
___
PLUG mailing list
PLUG@pdxlinux.org
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] Correcting duplicate strings in files [RESOLVED]

2018-06-19 Thread Rich Shepard


On Tue, 19 Jun 2018, Robert Citek wrote:


Awk is a very nice "little" language. Glad to hear it worked. And thanks
for letting us know.


Robert,

  I do a lot of environmental data munging/wragling/ETL. These come to me as
.xml spreadsheets or the equivalent of line printer output sent as PDF files
(from federal resource agencies). I have found that emacs and awk, with the
occasional use of sed, do the job. Now and then I hit a new requirement
(such as reformatting a date from MM/DD/YY to -MM-DD) and my awk book
and web searches quickly find a working solution.

  I suspected that awk had flags, but the few web pages (including web fora)
did not use them the way I needed them to work. I've acquired a nice
collection of awk scripts that transform spreadsheet exports so the data can
be used in R, postgres, and GRASS.

Thanks again,

Rich

___
PLUG mailing list
PLUG@pdxlinux.org
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] Correcting duplicate strings in files

2018-06-19 Thread Rich Shepard


On Tue, 19 Jun 2018, Carl Karsten wrote:


It could be done with transistors if you spend enough time ;)


Carl,

  Microprocessors.


I would add some code that verifies assumptions, like
are the dates always the same
is it just the 1700 are 1600?


  Those are hours on the 24-hour clock: 16:00 (4 pm) and 17:00 (5 pm).


anyway, assuming all our descriptions and assumptions are correct,
and the file starts at 2012-10-01,14:00


  Each day starts at 00:00 and runs through 23:00 hour-by-hour.

  The awk script Robert re-did does the job and I corrected all 20 years
where my script error provided two 16:00 hours.

Thanks,

Rich

___
PLUG mailing list
PLUG@pdxlinux.org
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] Correcting duplicate strings in files [RESOLVED]

2018-06-19 Thread Robert Citek

Awk is a very nice "little" language.  Glad to hear it worked.  And
thanks for letting us know. - Robert

On Tue, Jun 19, 2018 at 4:58 PM, Rich Shepard  wrote:
> On Tue, 19 Jun 2018, Robert Citek wrote:
>
>> $2 != "16.00" { print ; next }  <= the decimal should be a colon, 16:00 vs
>> 16.00
>
>
> Robert,
>
>   Oy! Too often we see what we expect to see, not what's actually there. I
> had that in a FORTRAN IV program in the early 1970s.
>
>> flag == 1 && $2 == "16:00" { $2=="17:00"; print; flag=0 ; next } <=
>> equality should be assignment, $2= vs $2==
>
>
>   Ah, I missed that completely, as well as the order of pattern tests.
>
>> Here's a refactored version that you can put in a file:
>>
>> BEGIN {OFS=FS=","} ;
>> flag == 1 && $2 == "16:00" { $2 = "17:00" ; flag = 0 } ;
>> $2 == "16:00" { flag = 1 } ;
>> { print } ;
>
>
>   And it works. Thanks for teaching me a tool that will be applied to other
> awk scripts.
>
>> BTW, in your sample data set the 15:00 and 16:00 entries are identical
>> in the last field.  Is that expected or coincidental?
>
>
>   Expected. This is river stage height data (the elevation of the water
> surface) and it may be constant for a while, or vary fairly regularly. What
> I'm interested in is the pattern cycles: diurnal, seasonal, and annual.
>
> Best regards,
>
> Rich
> ___
> PLUG mailing list
> PLUG@pdxlinux.org
> http://lists.pdxlinux.org/mailman/listinfo/plug
___
PLUG mailing list
PLUG@pdxlinux.org
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] Correcting duplicate strings in files

2018-06-19 Thread Carl Karsten

It could be done with transistors if you spend enough time ;)

I would add some code that verifies assumptions, like
are the dates always the same
is it just the 1700 are 1600?

anyway, assuming all our descriptions and assumptions are correct,
and the file starts at 2012-10-01,14:00


import csv
from datetime import datetime, timedelta

year = 2012
file_name = 'observation_{}.csv'.format(year)

start_time = datetime(year,10,1,14)

for h,input_line in enumerate(csv.reader(open(file_name))):
timestamp = start_time + timedelta(hours=h)
data_line = "{},{}".format(
timestamp.strftime("%Y-%m-%d,%H:%M"), input_line[2] )
print(data_line)

carl@twist:~/temp$ python allhours.py
2012-10-01,14:00,90.7999
2012-10-01,15:00,90.8121
2012-10-01,16:00,90.8121
2012-10-01,17:00,90.8121
2012-10-01,18:00,90.8091
2012-10-01,19:00,90.8030

On Tue, Jun 19, 2018 at 6:24 PM, Rich Shepard  wrote:
> On Tue, 19 Jun 2018, Carl Karsten wrote:
>
>> Python will be the easiest to understand.
>> is it always 16:00, or is it any time the whole line is duplicated,
>> bump the 2nds hour?
>
>
> Carl,
>
>   The values may differ by hour. It's only the second 16:00 hour each day
> that
> is incorrect.
>
>> also, if you have one line for every hour of the year, how about
>> looping over all those datetimes, pared up with your data, and replace
>> all the datetimes (both good and flawed) with the calculated datetime.
>
>
>   I have everything correct but for the duplicated 4pms.
>
>> Here is 1/2 of it:
>>
>> from datetime import datetime, timedelta
>>
>> for h in range(8760):
>>timestamp = datetime(2012,1,1) + timedelta(hours=h)
>>data_line = "{},{}".format(
>>timestamp.strftime("%Y-%m-%d,%H:%M"),
>>"123.456")
>>print(data_line)
>
>
>   Here's my test file (test.dat):
>
> 2012-10-01,14:00,90.7999
> 2012-10-01,15:00,90.8121
> 2012-10-01,16:00,90.8121
> 2012-10-01,16:00,90.8121
> 2012-10-01,18:00,90.8091
> 2012-10-01,19:00,90.8030
>
>   I know it can be done in awk with a flag; but don't know how to do this
> correctly. :-)
>
>
> Thanks,
>
> Rich
>
>
> ___
> PLUG mailing list
> PLUG@pdxlinux.org
> http://lists.pdxlinux.org/mailman/listinfo/plug



-- 
Carl K
___
PLUG mailing list
PLUG@pdxlinux.org
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] Correcting duplicate strings in files [RESOLVED]

2018-06-19 Thread Rich Shepard


On Tue, 19 Jun 2018, Robert Citek wrote:


$2 != "16.00" { print ; next }  <= the decimal should be a colon, 16:00 vs 16.00


Robert,

  Oy! Too often we see what we expect to see, not what's actually there. I
had that in a FORTRAN IV program in the early 1970s.


flag == 1 && $2 == "16:00" { $2=="17:00"; print; flag=0 ; next } <=
equality should be assignment, $2= vs $2==


  Ah, I missed that completely, as well as the order of pattern tests.


Here's a refactored version that you can put in a file:

BEGIN {OFS=FS=","} ;
flag == 1 && $2 == "16:00" { $2 = "17:00" ; flag = 0 } ;
$2 == "16:00" { flag = 1 } ;
{ print } ;


  And it works. Thanks for teaching me a tool that will be applied to other
awk scripts.


BTW, in your sample data set the 15:00 and 16:00 entries are identical
in the last field.  Is that expected or coincidental?


  Expected. This is river stage height data (the elevation of the water
surface) and it may be constant for a while, or vary fairly regularly. What
I'm interested in is the pattern cycles: diurnal, seasonal, and annual.

Best regards,

Rich
___
PLUG mailing list
PLUG@pdxlinux.org
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] Correcting duplicate strings in files

2018-06-19 Thread Carl Karsten

On Tue, Jun 19, 2018 at 2:29 PM, Rich Shepard  wrote:
> On Tue, 19 Jun 2018, Robert Citek wrote:
>
>> I don't fully understand your question, but here are some examples
>> that may be a step in the right direction:
>
>
> Robert,
>
>   I did not provide as complete an explanation as I should have.
>
>   Each file has 8761 lines, one for each hour of each day during the
> (non-leap) year, plus a header line. It's not just two isolated lines,
> unfortunately.
>
>   I don't follow the logic of your two examples for finding only the
> duplicate 16:00 hours in each day, and changing only the second instance to
> 17:00.
>
>> $ seq 1 5 | sed -e '1~2s/$/ --/'
>> 1 --
>> 2
>> 3 --
>> 4
>> 5 --
>>
>> $ seq 1 5 | sed -e '0~2s/$/ --/'
>> 1
>> 2 --
>> 3
>> 4 --
>> 5
>
>
>   Perhaps I need to write a python script that looks for the string, 16:00,
> and sets a flag the first time that's found. The next time it's found, in
> the following row, the flag is set so the string is changed to 17:00 and the
> flag is unset. Then the script keeps reading until it encounters the next
> day's 16:00 row.
>

Python will be the easiest to understand.

is it always 16:00, or is it any time the whole line is duplicated,
bump the 2nds hour?


also, if you have one line for every hour of the year, how about
looping over all those datetimes, pared up with your data, and replace
all the datetimes (both good and flawed) with the calculated datetime.

Here is 1/2 of it:

from datetime import datetime, timedelta

for h in range(8760):
timestamp = datetime(2012,1,1) + timedelta(hours=h)
data_line = "{},{}".format(
timestamp.strftime("%Y-%m-%d,%H:%M"),
"123.456")
print(data_line)


2012-01-01,00:00,123.456
2012-01-01,01:00,123.456
2012-01-01,02:00,123.456
2012-01-01,03:00,123.456
2012-01-01,04:00,123.456
2012-01-01,05:00,123.456
...
2012-12-30,14:00,123.456
2012-12-30,15:00,123.456
2012-12-30,16:00,123.456
2012-12-30,17:00,123.456
2012-12-30,18:00,123.456
2012-12-30,19:00,123.456
2012-12-30,20:00,123.456
2012-12-30,21:00,123.456
2012-12-30,22:00,123.456
2012-12-30,23:00,123.456

If you will post the top of your file, (so I can get csv headers and
data that lines up with my timestamps)
I'll add the rest.

I am guessing you have filenames based on year, like
observation-2012.csv  - give me the file name and I'll roll that in.


> Thanks,
>
>
> Rich
>
> ___
> PLUG mailing list
> PLUG@pdxlinux.org
> http://lists.pdxlinux.org/mailman/listinfo/plug



-- 
Carl K
___
PLUG mailing list
PLUG@pdxlinux.org
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] Correcting duplicate strings in files

2018-06-19 Thread Robert Citek

$2 != "16.00" { print ; next }  <= the decimal should be a colon, 16:00 vs 16.00
flag == 1 && $2 == "16:00" { $2=="17:00"; print; flag=0 ; next } <=
equality should be assignment, $2= vs $2==

Here's a refactored version that you can put in a file:

BEGIN {OFS=FS=","} ;
flag == 1 && $2 == "16:00" { $2 = "17:00" ; flag = 0 } ;
$2 == "16:00" { flag = 1 } ;
{ print } ;


BTW, in your sample data set the 15:00 and 16:00 entries are identical
in the last field.  Is that expected or coincidental?

Regards,
- Robert


On Tue, Jun 19, 2018 at 3:31 PM, Rich Shepard  wrote:
> On Tue, 19 Jun 2018, Robert Citek wrote:
>
>> Couple of typos and an addition (-F,) :
>
>
>   I'm not seeing the typos.
>
>> { cat <> 2012-10-01,14:00,90.7999
>> 2012-10-01,15:00,90.8121
>> 2012-10-01,16:00,90.8121
>> 2012-10-01,16:00,90.8121
>> 2012-10-01,18:00,90.8091
>> 2012-10-01,19:00,90.8030
>> eof
>> } | awk -F,  '
>> $2 != "16:00" { print ; next }
>> flag == 0 && $2 == "16:00" { print ; flag=1 ; next }
>> flag == 1 && $2 == "16:00" { $2="17:00"; print; flag=0 ; next }
>> '
>
>
>   I have the code in a file and run it with the '-f' option:
> gawk -f correct-double-hour.awk test.dat > out.dat
>
> correct-double-hour.awk:
>
> #!/usr/bin/gawk
> #
> # This script replaces the second instance of 16:00 with 17:00.
>
> BEGIN { FS=","; OFS="," }
> $2 != "16.00" { print ; next }
> flag == 0 && $2 == "16:00" { print ; flag=1 ; next }
> flag == 1 && $2 == "16:00" { $2=="17:00"; print; flag=0 ; next }
>
> Thanks,
>
> Rich
> ___
> PLUG mailing list
> PLUG@pdxlinux.org
> http://lists.pdxlinux.org/mailman/listinfo/plug
___
PLUG mailing list
PLUG@pdxlinux.org
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] Correcting duplicate strings in files

2018-06-19 Thread Rich Shepard


On Tue, 19 Jun 2018, Robert Citek wrote:


Couple of typos and an addition (-F,) :


  I'm not seeing the typos.


{ cat <

  I have the code in a file and run it with the '-f' option:
gawk -f correct-double-hour.awk test.dat > out.dat

correct-double-hour.awk:

#!/usr/bin/gawk
#
# This script replaces the second instance of 16:00 with 17:00.

BEGIN { FS=","; OFS="," }
$2 != "16.00" { print ; next }
flag == 0 && $2 == "16:00" { print ; flag=1 ; next }
flag == 1 && $2 == "16:00" { $2=="17:00"; print; flag=0 ; next }

Thanks,

Rich
___
PLUG mailing list
PLUG@pdxlinux.org
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] Correcting duplicate strings in files

2018-06-19 Thread Robert Citek

Couple of typos and an addition (-F,) :

{ cat < wrote:
> On Tue, 19 Jun 2018, Robert Citek wrote:
>
>> A quick pass.  Needs testing and refactoring.
>>
>> $2 != "16.00" { print ; next }
>> flag == 0 && $2 == "16:00" { print ; flag=1 ; next }
>> flag == 1 && $2 == "16:00" { $2=="17:00"; print; flag=0 ; next }
>
>
>   Thanks, Robert. I tried variations of this using if and regex for the
> patterns, but they didn't work. Here's my test file (cleverly named
> test.dat):
>
> 2012-10-01,14:00,90.7999
> 2012-10-01,15:00,90.8121
> 2012-10-01,16:00,90.8121
> 2012-10-01,16:00,90.8121
> 2012-10-01,18:00,90.8091
> 2012-10-01,19:00,90.8030
>
>   Your script did what mine did, added two more rows with 16:00:
>
> 2012-10-01,14:00,90.7999
> 2012-10-01,15:00,90.8121
> 2012-10-01,16:00,90.8121
> 2012-10-01,16:00,90.8121
> 2012-10-01,16:00,90.8121
> 2012-10-01,16:00,90.8121
> 2012-10-01,18:00,90.8091
> 2012-10-01,19:00,90.8030
>
>   Wrapping the patterns in parentheses and forward slashes makes no
> difference. I'm sure the correct script will appear to be obvious once I
> learn how to do this.
>
> Best regards,
>
>
> Rich
>
> ___
> PLUG mailing list
> PLUG@pdxlinux.org
> http://lists.pdxlinux.org/mailman/listinfo/plug
___
PLUG mailing list
PLUG@pdxlinux.org
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] Correcting duplicate strings in files

2018-06-19 Thread Robert Citek

A quick pass.  Needs testing and refactoring.

$2 != "16.00" { print ; next }
flag == 0 && $2 == "16:00" { print ; flag=1 ; next }
flag == 1 && $2 == "16:00" { $2=="17:00"; print; flag=0 ; next }


On Tue, Jun 19, 2018 at 2:04 PM, Rich Shepard  wrote:
> On Tue, 19 Jun 2018, Robert Citek wrote:
>
>> Good luck and let us know how things go.
>
>
>   This can be done using awk and flags. I've not before used flags in awk so
> I don't know the proper sequence of commands. What I have now is:
>
> $2!="16.00" { print }
> $2=="16:00" { print; flag=1 }
> $2=="16:00" { $2=="17:00"; print; flag=0 }
>
>   This prints the input file without change. If anyone has thoughts on how
> to use a flag to change the value of field 2 please share them with me.
>
>
> Rich
> ___
> PLUG mailing list
> PLUG@pdxlinux.org
> http://lists.pdxlinux.org/mailman/listinfo/plug
___
PLUG mailing list
PLUG@pdxlinux.org
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] Correcting duplicate strings in files

2018-06-19 Thread Rich Shepard


On Tue, 19 Jun 2018, Robert Citek wrote:


Good luck and let us know how things go.


  This can be done using awk and flags. I've not before used flags in awk so
I don't know the proper sequence of commands. What I have now is:

$2!="16.00" { print }
$2=="16:00" { print; flag=1 }
$2=="16:00" { $2=="17:00"; print; flag=0 }

  This prints the input file without change. If anyone has thoughts on how
to use a flag to change the value of field 2 please share them with me.

Rich
___
PLUG mailing list
PLUG@pdxlinux.org
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] Correcting duplicate strings in files

2018-06-19 Thread Rich Shepard


On Tue, 19 Jun 2018, Robert Citek wrote:


I don't fully understand your question, but here are some examples
that may be a step in the right direction:


Robert,

  I did not provide as complete an explanation as I should have.

  Each file has 8761 lines, one for each hour of each day during the
(non-leap) year, plus a header line. It's not just two isolated lines,
unfortunately.

  I don't follow the logic of your two examples for finding only the
duplicate 16:00 hours in each day, and changing only the second instance to
17:00.


$ seq 1 5 | sed -e '1~2s/$/ --/'
1 --
2
3 --
4
5 --

$ seq 1 5 | sed -e '0~2s/$/ --/'
1
2 --
3
4 --
5


  Perhaps I need to write a python script that looks for the string, 16:00,
and sets a flag the first time that's found. The next time it's found, in
the following row, the flag is set so the string is changed to 17:00 and the
flag is unset. Then the script keeps reading until it encounters the next
day's 16:00 row.

Thanks,

Rich

___
PLUG mailing list
PLUG@pdxlinux.org
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] Correcting duplicate strings in files

2018-06-19 Thread Robert Citek

I don't fully understand your question, but here are some examples
that may be a step in the right direction:

$ seq 1 5 | sed -e '1~2s/$/ --/'
1 --
2
3 --
4
5 --

$ seq 1 5 | sed -e '0~2s/$/ --/'
1
2 --
3
4 --
5

$ echo -e "2012-10-01,16:00,297.94\n2012-10-01,16:00,297.94" | sed -e
'0~2s/16:00/17:00/'
2012-10-01,16:00,297.94
2012-10-01,17:00,297.94

Good luck and let us know how things go.

Regards,
- Robert

On Tue, Jun 19, 2018 at 11:52 AM, Rich Shepard  wrote:
>   I made a mistake when writing an awk script that inserts the time of an
> observation with its value. I had 16:00 twice in a row rather than 16:00 and
> 17:00. This holds for every day in the year, and I have about 12 year's in
> which to make the correction. Specifically, changing the second 16:00 to
> 17:00. A sample:
>
> 2012-10-01,16:00,297.94
> 2012-10-01,16:00,297.94
>
>   I'm stuck trying to find a way to make the change using sed, awk, or grep.
> How do I ignore the first instance and change only the second instance?
>
>   If there's a perl script to do this, please share it with me as I'm not a
> perl coder.
>
>   I'm looking forward to learning how to do this job.
>
> Rich
> ___
> PLUG mailing list
> PLUG@pdxlinux.org
> http://lists.pdxlinux.org/mailman/listinfo/plug
___
PLUG mailing list
PLUG@pdxlinux.org
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] Correcting duplicate strings in files

Re: [PLUG] Correcting duplicate strings in files

Re: [PLUG] Correcting duplicate strings in files

Re: [PLUG] Correcting duplicate strings in files [RESOLVED]

Re: [PLUG] Correcting duplicate strings in files

Re: [PLUG] Correcting duplicate strings in files [RESOLVED]

Re: [PLUG] Correcting duplicate strings in files

Re: [PLUG] Correcting duplicate strings in files [RESOLVED]

Re: [PLUG] Correcting duplicate strings in files

Re: [PLUG] Correcting duplicate strings in files

Re: [PLUG] Correcting duplicate strings in files

Re: [PLUG] Correcting duplicate strings in files

Re: [PLUG] Correcting duplicate strings in files

Re: [PLUG] Correcting duplicate strings in files

Re: [PLUG] Correcting duplicate strings in files

Re: [PLUG] Correcting duplicate strings in files

16 matches

Site Navigation

Mail list logo

Footer information