Re: [Trisquel-users] Removing unwanted carriage returns

2020-03-25 Thread amenex

Previously, I had attempted a join script:

> Step (5) I attempted to join the present spreadsheet with the  
domains-visited and visits-per-domain data:


> join -a 2 -1 1 -2 1  But the results look incomplete: only 13,000 rows of  
fully filled-in data with correct & complete counts,
> yet there are 330,000 rows of the uncombined data ... adding up to 343,000  
rows. Needs some work ...


You bet it needs some work ... I had made a couple of irreparable errors, so  
I restarted the construction
of the useless spreadsheet, which is now ready to be filled in per a previous  
posting. More about this later.


George Langford


Re: [Trisquel-users] Removing unwanted carriage returns

2020-03-23 Thread amenex

Remember the Delta Process from Calculus 1.01 ?
https://www.brighthubeducation.com/homework-math-help/108376-finding-the-derivative-of-a-function-in-calculus/

That's where I am in Scripting 1.01 ...

Back to the problem at hand.

Step (1) selected the IPv6 addresses of the Type A & Type B rows in the  
cleansed File01.txt:

awk '{print $2}' 'File01.txt' | sort | uniq -c > TempTQ01.txt ;
awk '{print $2, $1}' 'TempTQ01.txt' | sort -nrk 2 > TempTQ02.txt

Step (2) selects and lists all the Type B entries in File01.txt  
(SS.IPv6-HN-GLU-MB-Domains-January2020-All.ods.txt):

awk '{print $1}' 'TempTQ02.txt' | sort > TempTQ10.txt ;
awk '{print $1}' 'TempTQ10.txt' - | grep - File01.txt | awk '{print  
$1,$2,$3,$4}' '-' > TempTQ11.txt


Never mind simplicity or efficiency; it took 0.006 second CPU time and 0.032  
second real time.
It did reveal a number of Type C rows that I had missed in my visual  
inspection ==> TempTQ13.txt


Next step: For each row in TempTQ11.txt, print $2,$3,$4 to cache, find $1 in  
File10.txt's $2 column,
and print that $2 to Column $1 along with cache's $2,$3,$4 into a new file  
...


Step (3) matches the Keys in Col.$2 of the Type A rows with the data in Col's  
$2,$3 & $4 of Type B rows:
join -a 2 -1 1 -2 2 


Re: [Trisquel-users] Removing unwanted carriage returns

2020-03-22 Thread amenex
These "trivial" AWK programs are presently beyond my ken. Way too compact for  
me at this hour.


In the meantime I started with this script:

awk '{print $2}' 'File01.txt' | sort | uniq -c > TempTQ01.txt ;
awk '{print $2, $1}' 'TempTQ01.txt' | sort -nrk 2 > TempTQ02.txt

where File01.txt is the original 370,000 row file, albeit truncated to  
exclude the 38,000 rows
that I've already filled in and also all the five-column resolved hostname  
rows, leaving only
Type A and Type B IPv6 address data in 347,000 rows. Type B are the keyed  
IPv6 addresses and
the number of occurrences; Type B are CIDR blocks distinguishable by their  
'/' characters. We

don't need to use the Type B data except as a reality check.

What remains is to print all the $1 columns of the IPv6 rows that match the  
first IPv6 key in
the TempTQ02 list, plus the $2, $3, and $4 columns of the corresponding Type  
B row to make C
rows (Column $2 of TempTQ02.txt) of filled-in data, then move on to the next  
IPv6 key in the
TempTQ02.txt file. The largest number of occurrences (55,000) exist in one  
contiguous group
of 55,000 rows, one of which contains the IPv6 key address and its three  
columns of asn-query

data. The occurrences data (C) are also needed only as a reality check.

I also meddled with the asn-query source code  
(https://svn.nmap.org/nmap/scripts/asn-query.nse)
and learned how to store & retrieve it as a program file which returns the  
same data for those
eight IPv6 addresses given above, plus the asn-query data. Alas, further  
meddling (beyond just
putting something else between the quotes of "See the result for %s") has  
been unproductive.


George Langford


Re: [Trisquel-users] Removing unwanted carriage returns

2020-03-22 Thread amenex
I'll restate the problem, unencumbered by distracting arrays of colons and  
hexadecimals.


All 387,000 rows fall into one of three types, each IP address appearing only  
once in the first column:


Type A: $1("key" IP address), $2(CIDR block), $3(country code), $4(AS number)

Type B: $1(IP address falling within the $2CIDR block of Type A), $2(Type A's  
"key" IP address, repeated many times in successive rows)


Type C: $1(hostname), $2(Ip address from which $1hostname can be resolved),  
$3(CIDR block), $4(country code), $5(AS number)


(Type C is not very populous and can be handled with Leafpad)

The desired script:
  awk should locate Type A's $1Key and find all the Type B rows whose  
$2Key match $1's Key, and then
  copy Type A's columns $2, $3 & $4 in place of Type B's column $2 in  
every instance of a match with Type A's $1Key


I have found a small number of Type A rows with no data, but those I can look  
up with whois and fix easily.


The already looked-up hostnames are the only non-IP data in the $1 columns of  
Types A & B, so awk can safely

concentrate on all the Columns $1.

Also, all the IP addresses of looked-up hostnames will not reappear as  
not-looked-up IP addresses.


If awk can do everything described above with the first Type A $1Key before  
proceeding, even if that
involves searching the entire 370,000 rows once for each Type A $1Key, then  
we're on the right track.


George Langford


Re: [Trisquel-users] Removing unwanted carriage returns

2020-03-22 Thread amenex

See: https://svn.nmap.org/nmap/scripts/asn-query.nse
where the applicable (?) script reads, noting especially "( "See the result  
for %s" ):format( last_ip )":


--
... begin snip

---
-- Checks whether the target IP address is within any BGP prefixes for which  
a query has
-- already been performed and returns a pointer to the HOST SCRIPT RESULT  
displaying the applicable answers.

-- @param ip String representing the target IP address.
-- @return   Boolean true if there are cached answers for the supplied  
target, otherwise

--   false.
-- @return   Table containing a string for each answer or nil if there are  
none.


function check_cache( ip )
  local ret = {}

  -- collect any applicable answers
  for _, cache_entry in ipairs( nmap.registry.asn.cache ) do
if ipOps.ip_in_range( ip, cache_entry.cache_bgp ) then
  ret[#ret+1] = cache_entry
end
  end
  if #ret < 1 then return false, nil end

  -- /0 signals that we want to kill this thread (all threads in fact)
  if #ret == 1 and type( ret[1].cache_bgp ) == "string" and  
ret[1].cache_bgp:match( "/0" ) then return true, nil end


  -- should return pointer unless there are more than one unique pointer
  local dirty, last_ip = false
  for _, entry in ipairs( ret ) do
if last_ip and last_ip ~= entry.pointer then
  dirty = true; break
end
last_ip = entry.pointer
  end
  if not dirty then
return true, ( "See the result for %s" ):format( last_ip )
  else
return true, ret
  end

  return false, nil
end

... end snip

--

Where we should _print_ the result for %s instead of just pointing to it ...

George Langford


Re: [Trisquel-users] Removing unwanted carriage returns

2020-03-22 Thread amenex

Here's my present dilemma, exemplified by a snippet from the spreadsheet:

2401:4900:1888:c07f:1:2:4283:5767   2401:4900:1888:fcb4:1:2:4282:aab3   

2401:4900:1888:cd70:1:1:4a58:fc0c   2401:4900:1888:fcb4:1:2:4282:aab3   

2401:4900:1888:d068:fce8:8739:a7a0:4c60 2401:4900:1888:fcb4:1:2:4282:aab3   

2401:4900:1888:e8f5:1:2:4cde:e7ca   2401:4900:1888:fcb4:1:2:4282:aab3   

2401:4900:1888:ee55:23c5:e0ec:79fb:59dd 2401:4900:1888:fcb4:1:2:4282:aab3   

2401:4900:1888:fcb4:1:2:4282:aab3   2401:4900:1888::/48 IN  AS45609
2401:4900:1889:9396:5693:8b98:3a70:da67 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0 

2401:4900:1889:a2d9:382e:b73:73dd:8693  2401:4900:1889:ec73:92c5:3a0a:76d5:10c0 

2401:4900:1889:aa8c:730c:fa94:8c27:7bf9 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0 

2401:4900:1889:aad7:1:1:7b54:1e4c   2401:4900:1889:ec73:92c5:3a0a:76d5:10c0 

2401:4900:1889:c648:2161:968a:1c9e:b1c1 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0 

2401:4900:1889:c7c0:f461:a726:a208:3ccb 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0 

2401:4900:1889:cd44:e950:74db:8fd2:c134 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0 

2401:4900:1889:ec73:92c5:3a0a:76d5:10c0 2401:4900:1889::/48 IN  AS45609

The positions (i.e. $2) ending in ...:aab3 have to be replaced with
2401:4900:1888::/48 IN AS45609
and the positions ending in ...:10c0 ($2) have to be replaced with
2401:4900:1889::/48 IN AS45609 (i.e., $2,$3,$4)

Those key rows, returned by nmap but not repeated by nmap, could have been  
anywhere in
the preceding rows. Of course nmap should not have to repeat the look-ups,  
but merely

repeating the stating of them would be helpful. It is open source ...

The entire text file has 387,000 rows, so even an inefficient script would be  
plenty
fast enough. I can fill in about five thousand rows an hour ... leading to a  
time
estimate of 387,000/5,000*1 hour = 77 hours ... not impossible while I'm  
housebound.


It may look silly when the spreadsheet is sorted by IPv6 address, but it's  
all very
necessary when it's sorted by the number of domains visited and/or the number  
of visits

per domain.

George Langford


Re: [Trisquel-users] Removing unwanted carriage returns

2020-03-22 Thread amenex

janet admonished me:
> Did you even bother to read the regular expression I provided to use in  
vim?


I stopped reading after the word, "windows."


Re: [Trisquel-users] Removing unwanted carriage returns

2020-03-22 Thread amenex

Magic Banana constructively added:
> It looks like you could have nmap format its output ...

Oh ! Gee ! That's a welcome suggestion. I have two more sets of IPv6 data
already nmap'ed over quite a few hours that are in the old grep-unfriendly
format. Fortunately, my brute-force workarounds are less time-consuming
than the original nmap scans, from which there is no escape.

Unfortunately, nmap's avoidance of excessive repetition runs afoul of my
use of the simple-to-use LibreOffice Calc in that I'm faced with multiple
days of filling in the empty cells between the infrequent asn-query results,
which nmap limits to one lookup per CIDR block.

Another roadblock is Google's aversion to robots, so my search for "other"
IP addresses of multi-addressed PTR's is necessarily a manual task, what
with the scores of CIDR blocks filled with identically named PTR's.

Try chasing down hn.kd.ny.adsl, motorspb.fvds.ru or hosted-by.leaseweb.com.

George Langford



Re: [Trisquel-users] Removing unwanted carriage returns

2020-03-22 Thread vas1980i

amenex wrote:
>Not just _any_ new line character: A combination of the new line character
>on the end of one row, plus the phrase at the beginning of the following  
row.


>Removing the new line characters willy-nilly will leave a one-row file with
>all 750,000 lines all concatenated together ... I've done that  
inadvertently.


Did you even bother to read the regular expression I provided to use in vim?

%s/\nSee the result for/\tSee/g
it means: Find the new line followed by "See the result for" and replace it  
with tab character followed by a word "See" - exactly what you've been asking  
about.




Re: [Trisquel-users] Removing unwanted carriage returns

2020-03-21 Thread amenex

jaret remarked:
> I believe you are referencing to a new line character, ...

Not just _any_ new line character: A combination of the new line character
on the end of one row, plus the phrase at the beginning of the following row.

Removing the new line characters willy-nilly will leave a one-row file with
all 750,000 lines all concatenated together ... I've done that inadvertently.

What I did do was to divide those 750,000 rows into twenty 50,000 row files
and then apply search & replace in Leafpad, which took a couple of minutes
for each file. It took longer to subdivide the original file by hand ...

George Langford


Re: [Trisquel-users] Removing unwanted carriage returns

2020-03-21 Thread vas1980i

Install vim.
Open text file in vim.
Press semicolon on keyboard ( : )
Type %s/\nSee the result for/\tSee/g
Press Enter.
To save file and quit press semicolon : then wq then Enter. ( :wq )

I believe you are referencing to a new line character, when you are saying  
"carriage return".
\n is the newline character in regular expression and is standard Unix  
end-of-the-line character.

On Windows-DOS text it is carriage return followed by a new line \r\n.
In that situation the regexp will be %s/\r\nSee the result for/\tSee/g