With Nutch all things are possible...... To crawl/index videos on the general web, I would perform the following steps.
1. Determine which video hosting services desired to index (i.e. Youtube, Vimeo, Godtube, etc.) 2. Review each hosting service for standard embedded approaches (i.e. iframe) 3. Review popular WordPress, VB, phpBB plugins for standard embedded approaches. (this will provide good general coverage of the general web.) 4. Determine common video embedded models used on the web. (iframe tag, src=video url) 5. Build video parser/index tool. Tool will need to include iframe tags and src. html parser and tika parser have a pretty straight forward code to build a new parser from. 6. Test like crazy with predefined web pages/test pages. 7. Share video parser/index patch on Nutch Jira or github for others to review/assist/support open source software. If you are planning on crawling the video hosting services directly, then you will need to determine html structure of each video hosting service and build a parser with that in mind. Hope this helps..... jeff On Fri, May 8, 2015 at 3:57 AM, Tizy Ninan <[email protected]> wrote: > Hi, > > Is it possible to crawl the videos (ex. YouTube videos) embedded in a > website? If so, what changes need to be made to enable video crawling. Will > it be similar to crawling images? > > Kindly provide insights on this. Thanks in advance. > > Thanks and Regards, > Tizy >

