Re: [Trisquel-users] Removing unwanted carriage returns
Previously, I had attempted a join script: > Step (5) I attempted to join the present spreadsheet with the domains-visited and visits-per-domain data: > join -a 2 -1 1 -2 1 But the results look incomplete: only 13,000 rows of fully filled-in data with correct & complete counts, > yet there are 330,000 rows of the uncombined data ... adding up to 343,000 rows. Needs some work ... You bet it needs some work ... I had made a couple of irreparable errors, so I restarted the construction of the useless spreadsheet, which is now ready to be filled in per a previous posting. More about this later. George Langford
Re: [Trisquel-users] Removing unwanted carriage returns
Remember the Delta Process from Calculus 1.01 ? https://www.brighthubeducation.com/homework-math-help/108376-finding-the-derivative-of-a-function-in-calculus/ That's where I am in Scripting 1.01 ... Back to the problem at hand. Step (1) selected the IPv6 addresses of the Type A & Type B rows in the cleansed File01.txt: awk '{print $2}' 'File01.txt' | sort | uniq -c > TempTQ01.txt ; awk '{print $2, $1}' 'TempTQ01.txt' | sort -nrk 2 > TempTQ02.txt Step (2) selects and lists all the Type B entries in File01.txt (SS.IPv6-HN-GLU-MB-Domains-January2020-All.ods.txt): awk '{print $1}' 'TempTQ02.txt' | sort > TempTQ10.txt ; awk '{print $1}' 'TempTQ10.txt' - | grep - File01.txt | awk '{print $1,$2,$3,$4}' '-' > TempTQ11.txt Never mind simplicity or efficiency; it took 0.006 second CPU time and 0.032 second real time. It did reveal a number of Type C rows that I had missed in my visual inspection ==> TempTQ13.txt Next step: For each row in TempTQ11.txt, print $2,$3,$4 to cache, find $1 in File10.txt's $2 column, and print that $2 to Column $1 along with cache's $2,$3,$4 into a new file ... Step (3) matches the Keys in Col.$2 of the Type A rows with the data in Col's $2,$3 & $4 of Type B rows: join -a 2 -1 1 -2 2
Re: [Trisquel-users] Removing unwanted carriage returns
These "trivial" AWK programs are presently beyond my ken. Way too compact for me at this hour. In the meantime I started with this script: awk '{print $2}' 'File01.txt' | sort | uniq -c > TempTQ01.txt ; awk '{print $2, $1}' 'TempTQ01.txt' | sort -nrk 2 > TempTQ02.txt where File01.txt is the original 370,000 row file, albeit truncated to exclude the 38,000 rows that I've already filled in and also all the five-column resolved hostname rows, leaving only Type A and Type B IPv6 address data in 347,000 rows. Type B are the keyed IPv6 addresses and the number of occurrences; Type B are CIDR blocks distinguishable by their '/' characters. We don't need to use the Type B data except as a reality check. What remains is to print all the $1 columns of the IPv6 rows that match the first IPv6 key in the TempTQ02 list, plus the $2, $3, and $4 columns of the corresponding Type B row to make C rows (Column $2 of TempTQ02.txt) of filled-in data, then move on to the next IPv6 key in the TempTQ02.txt file. The largest number of occurrences (55,000) exist in one contiguous group of 55,000 rows, one of which contains the IPv6 key address and its three columns of asn-query data. The occurrences data (C) are also needed only as a reality check. I also meddled with the asn-query source code (https://svn.nmap.org/nmap/scripts/asn-query.nse) and learned how to store & retrieve it as a program file which returns the same data for those eight IPv6 addresses given above, plus the asn-query data. Alas, further meddling (beyond just putting something else between the quotes of "See the result for %s") has been unproductive. George Langford
Re: [Trisquel-users] Removing unwanted carriage returns
I'll restate the problem, unencumbered by distracting arrays of colons and hexadecimals. All 387,000 rows fall into one of three types, each IP address appearing only once in the first column: Type A: $1("key" IP address), $2(CIDR block), $3(country code), $4(AS number) Type B: $1(IP address falling within the $2CIDR block of Type A), $2(Type A's "key" IP address, repeated many times in successive rows) Type C: $1(hostname), $2(Ip address from which $1hostname can be resolved), $3(CIDR block), $4(country code), $5(AS number) (Type C is not very populous and can be handled with Leafpad) The desired script: awk should locate Type A's $1Key and find all the Type B rows whose $2Key match $1's Key, and then copy Type A's columns $2, $3 & $4 in place of Type B's column $2 in every instance of a match with Type A's $1Key I have found a small number of Type A rows with no data, but those I can look up with whois and fix easily. The already looked-up hostnames are the only non-IP data in the $1 columns of Types A & B, so awk can safely concentrate on all the Columns $1. Also, all the IP addresses of looked-up hostnames will not reappear as not-looked-up IP addresses. If awk can do everything described above with the first Type A $1Key before proceeding, even if that involves searching the entire 370,000 rows once for each Type A $1Key, then we're on the right track. George Langford
Re: [Trisquel-users] Removing unwanted carriage returns
See: https://svn.nmap.org/nmap/scripts/asn-query.nse where the applicable (?) script reads, noting especially "( "See the result for %s" ):format( last_ip )": -- ... begin snip --- -- Checks whether the target IP address is within any BGP prefixes for which a query has -- already been performed and returns a pointer to the HOST SCRIPT RESULT displaying the applicable answers. -- @param ip String representing the target IP address. -- @return Boolean true if there are cached answers for the supplied target, otherwise -- false. -- @return Table containing a string for each answer or nil if there are none. function check_cache( ip ) local ret = {} -- collect any applicable answers for _, cache_entry in ipairs( nmap.registry.asn.cache ) do if ipOps.ip_in_range( ip, cache_entry.cache_bgp ) then ret[#ret+1] = cache_entry end end if #ret < 1 then return false, nil end -- /0 signals that we want to kill this thread (all threads in fact) if #ret == 1 and type( ret[1].cache_bgp ) == "string" and ret[1].cache_bgp:match( "/0" ) then return true, nil end -- should return pointer unless there are more than one unique pointer local dirty, last_ip = false for _, entry in ipairs( ret ) do if last_ip and last_ip ~= entry.pointer then dirty = true; break end last_ip = entry.pointer end if not dirty then return true, ( "See the result for %s" ):format( last_ip ) else return true, ret end return false, nil end ... end snip -- Where we should _print_ the result for %s instead of just pointing to it ... George Langford
Re: [Trisquel-users] Removing unwanted carriage returns
Here's my present dilemma, exemplified by a snippet from the spreadsheet: 2401:4900:1888:c07f:1:2:4283:5767 2401:4900:1888:fcb4:1:2:4282:aab3 2401:4900:1888:cd70:1:1:4a58:fc0c 2401:4900:1888:fcb4:1:2:4282:aab3 2401:4900:1888:d068:fce8:8739:a7a0:4c60 2401:4900:1888:fcb4:1:2:4282:aab3 2401:4900:1888:e8f5:1:2:4cde:e7ca 2401:4900:1888:fcb4:1:2:4282:aab3 2401:4900:1888:ee55:23c5:e0ec:79fb:59dd 2401:4900:1888:fcb4:1:2:4282:aab3 2401:4900:1888:fcb4:1:2:4282:aab3 2401:4900:1888::/48 IN AS45609 2401:4900:1889:9396:5693:8b98:3a70:da67 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0 2401:4900:1889:a2d9:382e:b73:73dd:8693 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0 2401:4900:1889:aa8c:730c:fa94:8c27:7bf9 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0 2401:4900:1889:aad7:1:1:7b54:1e4c 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0 2401:4900:1889:c648:2161:968a:1c9e:b1c1 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0 2401:4900:1889:c7c0:f461:a726:a208:3ccb 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0 2401:4900:1889:cd44:e950:74db:8fd2:c134 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0 2401:4900:1889:ec73:92c5:3a0a:76d5:10c0 2401:4900:1889::/48 IN AS45609 The positions (i.e. $2) ending in ...:aab3 have to be replaced with 2401:4900:1888::/48 IN AS45609 and the positions ending in ...:10c0 ($2) have to be replaced with 2401:4900:1889::/48 IN AS45609 (i.e., $2,$3,$4) Those key rows, returned by nmap but not repeated by nmap, could have been anywhere in the preceding rows. Of course nmap should not have to repeat the look-ups, but merely repeating the stating of them would be helpful. It is open source ... The entire text file has 387,000 rows, so even an inefficient script would be plenty fast enough. I can fill in about five thousand rows an hour ... leading to a time estimate of 387,000/5,000*1 hour = 77 hours ... not impossible while I'm housebound. It may look silly when the spreadsheet is sorted by IPv6 address, but it's all very necessary when it's sorted by the number of domains visited and/or the number of visits per domain. George Langford
Re: [Trisquel-users] Removing unwanted carriage returns
janet admonished me: > Did you even bother to read the regular expression I provided to use in vim? I stopped reading after the word, "windows."
Re: [Trisquel-users] Removing unwanted carriage returns
Magic Banana constructively added: > It looks like you could have nmap format its output ... Oh ! Gee ! That's a welcome suggestion. I have two more sets of IPv6 data already nmap'ed over quite a few hours that are in the old grep-unfriendly format. Fortunately, my brute-force workarounds are less time-consuming than the original nmap scans, from which there is no escape. Unfortunately, nmap's avoidance of excessive repetition runs afoul of my use of the simple-to-use LibreOffice Calc in that I'm faced with multiple days of filling in the empty cells between the infrequent asn-query results, which nmap limits to one lookup per CIDR block. Another roadblock is Google's aversion to robots, so my search for "other" IP addresses of multi-addressed PTR's is necessarily a manual task, what with the scores of CIDR blocks filled with identically named PTR's. Try chasing down hn.kd.ny.adsl, motorspb.fvds.ru or hosted-by.leaseweb.com. George Langford
Re: [Trisquel-users] Removing unwanted carriage returns
amenex wrote: >Not just _any_ new line character: A combination of the new line character >on the end of one row, plus the phrase at the beginning of the following row. >Removing the new line characters willy-nilly will leave a one-row file with >all 750,000 lines all concatenated together ... I've done that inadvertently. Did you even bother to read the regular expression I provided to use in vim? %s/\nSee the result for/\tSee/g it means: Find the new line followed by "See the result for" and replace it with tab character followed by a word "See" - exactly what you've been asking about.
Re: [Trisquel-users] Removing unwanted carriage returns
jaret remarked: > I believe you are referencing to a new line character, ... Not just _any_ new line character: A combination of the new line character on the end of one row, plus the phrase at the beginning of the following row. Removing the new line characters willy-nilly will leave a one-row file with all 750,000 lines all concatenated together ... I've done that inadvertently. What I did do was to divide those 750,000 rows into twenty 50,000 row files and then apply search & replace in Leafpad, which took a couple of minutes for each file. It took longer to subdivide the original file by hand ... George Langford
Re: [Trisquel-users] Removing unwanted carriage returns
Install vim. Open text file in vim. Press semicolon on keyboard ( : ) Type %s/\nSee the result for/\tSee/g Press Enter. To save file and quit press semicolon : then wq then Enter. ( :wq ) I believe you are referencing to a new line character, when you are saying "carriage return". \n is the newline character in regular expression and is standard Unix end-of-the-line character. On Windows-DOS text it is carriage return followed by a new line \r\n. In that situation the regexp will be %s/\r\nSee the result for/\tSee/g