With Nutch all things are possible......

To crawl/index videos on the general web, I would perform the following
steps.

1. Determine which video hosting services desired to index (i.e. Youtube,
Vimeo, Godtube, etc.)
2. Review each hosting service for standard embedded approaches (i.e.
iframe)
3. Review popular WordPress, VB, phpBB plugins for standard embedded
approaches. (this will provide good general coverage of the general web.)
4. Determine common video embedded models used on the web. (iframe tag,
src=video url)
5. Build video parser/index tool. Tool will need to include iframe tags and
src. html parser and tika parser have a pretty straight forward code to
build a new parser from.
6. Test like crazy with predefined web pages/test pages.
7. Share video parser/index patch on Nutch Jira or github for others to
review/assist/support open source software.

If you are planning on crawling the video hosting services directly, then
you will need to determine html structure of each video hosting service and
build a parser with that in mind.

Hope this helps.....

jeff

On Fri, May 8, 2015 at 3:57 AM, Tizy Ninan <[email protected]> wrote:

> Hi,
>
> Is it possible to crawl the videos (ex. YouTube videos) embedded in a
> website? If so, what changes need to be made to enable video crawling. Will
> it be similar to crawling images?
>
> Kindly provide insights on this. Thanks in advance.
>
> Thanks and Regards,
> Tizy
>

Reply via email to