On 06/12/2010 07:33 PM, fuzzylogic25 wrote:
I am doing this through cygwin, running vim.
It is afterall, a document that contains 2,560,213 lines.
For such a large source, you might consider doing this in sed
instead of vim. For your particular use-case (extracting URLs),
you might try the following:
1) save this as "http.sed"
-------------------8<----------------
# put all URL-ish things on their own line
# by putting a \n before/after the url
s<[hH][tT][tT][pP][sS]\?:\/\/[-a-zA-Z0-9~_!/=+?%:.,!*()@]\+<\n&\n<g
# if we didn't have a url in the string, continue
# by branching to the "e"nd
Te
# otherwise...
:a
# check if the current buffer starts with a url
s<^[hH][tT][tT][pP][sS]\?:\/\/[-a-zA-Z0-9~_!/=+?%:.,!*()@]\+\n<&<
# if it starts with a url, go to "s"uccess
ts
# otherwise nuke the stuff before a newline
s/^[^\n]*\n//
# if that succeeded, there's more to process
# so restart at "a"
ta
# otherwise, we're done with this line
# of input so branch to the "e"nd
be
:s
#success so print it (up to the 1st newline)
P
# delete up to the first newline
s/^[^\n]*\n//
# "branch", but use the "t"/"T" to clear
# the test-success flag
ta
Ta
:e
-------------------8<----------------
2) invoke sed on your file with
sed -nf http.sed < yourfile.txt > urls_out.txt
You might have to tweak the regexp for a URL, but the above is a
fairly loose approximation of a URL string beginning with
http/https (case insensitive).
I commented it pretty liberally with the "#" lines so you can see
what's going on. The only gotcha is the "ta"/"Ta" pair of
weirdnesses because if you just use "ba" to unconditionally jump
to the ":a" label, the success of the previous "delete up to the
first newline" sets the substitution-success flag regardless if
the next pass starts with a url (so the next time through the
loop, the "ts" succeeds even if it shouldn't). By using the
"ta"/"Ta" hack, it clears the substitution-success flag. Sed
ain't pretty, but for your use case, it might save you oodles of
time & CPU.
-tim
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php