On 06/12/2010 07:33 PM, fuzzylogic25 wrote:
I am doing this through cygwin, running vim.

It is afterall, a document that contains 2,560,213 lines.

For such a large source, you might consider doing this in sed instead of vim. For your particular use-case (extracting URLs), you might try the following:

1) save this as "http.sed"

-------------------8<----------------

# put all URL-ish things on their own line
# by putting a \n before/after the url
s<[hH][tT][tT][pP][sS]\?:\/\/[-a-zA-Z0-9~_!/=+?%:.,!*()@]\+<\n&\n<g

# if we didn't have a url in the string, continue
# by branching to the "e"nd
Te

# otherwise...
:a
 # check if the current buffer starts with a url
 s<^[hH][tT][tT][pP][sS]\?:\/\/[-a-zA-Z0-9~_!/=+?%:.,!*()@]\+\n<&<
 # if it starts with a url, go to "s"uccess
 ts
 # otherwise nuke the stuff before a newline
 s/^[^\n]*\n//
 # if that succeeded, there's more to process
 # so restart at "a"
 ta
 # otherwise, we're done with this line
 # of input so branch to the "e"nd
 be
:s
 #success so print it (up to the 1st newline)
 P
 # delete up to the first newline
 s/^[^\n]*\n//
 # "branch", but use the "t"/"T" to clear
 # the test-success flag
 ta
 Ta
:e

-------------------8<----------------

2) invoke sed on your file with

  sed -nf http.sed < yourfile.txt > urls_out.txt

You might have to tweak the regexp for a URL, but the above is a fairly loose approximation of a URL string beginning with http/https (case insensitive).


I commented it pretty liberally with the "#" lines so you can see what's going on. The only gotcha is the "ta"/"Ta" pair of weirdnesses because if you just use "ba" to unconditionally jump to the ":a" label, the success of the previous "delete up to the first newline" sets the substitution-success flag regardless if the next pass starts with a url (so the next time through the loop, the "ts" succeeds even if it shouldn't). By using the "ta"/"Ta" hack, it clears the substitution-success flag. Sed ain't pretty, but for your use case, it might save you oodles of time & CPU.

-tim


--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Reply via email to