Re: remove an HTML tag and all its children from commandline

2010-01-31 Thread Zhang Weiwu
T o n g 写道: For not-so-simple tasks, you need not-so-simple tools. Depending on how much time you'd like to investigate into such not-so-simple tools, take a look at lib?, sgrep or the xpath language. Sure. libwww and sgrep are tools, while xpath is a language. I believe I should try

Re: remove an HTML tag and all its children from commandline

2010-01-31 Thread Zhang Weiwu
Steve Kemp 写道: You might enjoy my html-tool command which would do the job for you via: Thank you very much for mentioning this tool. A first glance it seems this tool is just too wonderful, it is just designed to solve problems like mine. However after I try it what I worry most

Re: remove an HTML tag and all its children from commandline

2010-01-31 Thread Steve Kemp
On Sun Jan 31, 2010 at 10:54:46 +0800, Zhang Weiwu wrote: I want to remove all advertisements in my 100 html files. They are pretty neatly classed, like the following: div class=advertisement ... /div You might enjoy my html-tool command which would do the job for you via:

Re: remove an HTML tag and all its children from commandline

2010-01-31 Thread Zhang Weiwu
Zhang Weiwu 写道: Sure. libwww and sgrep are tools, while xpath is a language. I believe I should try xpath because I might use use it in other places too, but what tool to use for xpath? Now I think I can answer my own question, partly at least. There is a good tool for xpath that is named

Re: remove an HTML tag and all its children from commandline

2010-01-31 Thread T o n g
On Sun, 31 Jan 2010 20:05:46 +0800, Zhang Weiwu wrote: $ tidy -q -asxml -utf8 page_07_zh.html | xpath -e '//d...@class=advertisement]' exactly. Glad that you found both tidy libxml-xpath-perl, and solve the problem yourself. -- Tong (remove underscore(s) to reply)

remove an HTML tag and all its children from commandline

2010-01-30 Thread Zhang Weiwu
Hello. I believe this is a common case and must have been discussed before on various other forums like awk/sed/regular expression group. However I could not google them out. You would be helping me a lot if you simply point to a reference to a solution. I want to remove all advertisements in my

Re: remove an HTML tag and all its children from commandline

2010-01-30 Thread T o n g
On Sun, 31 Jan 2010 10:54:46 +0800, Zhang Weiwu wrote: I want to remove all advertisements in my 100 html files. They are pretty neatly classed, like the following: div class=advertisement ... /div However I could not simply do this: s/div class=advertisement.*/div// Because it is

Re: remove an HTML tag and all its children from commandline

2010-01-30 Thread Celejar
On Sun, 31 Jan 2010 10:54:46 +0800 Zhang Weiwu zhangwe...@realss.com wrote: ... I want to remove all advertisements in my 100 html files. They are pretty neatly classed, like the following: div class=advertisement ... /div However I could not simply do this: s/div