Both variants will work well (if your kafka cluster can handle the full volume of the transmitted data for the duration of the ttl on each topic) .
I would run the whole thing through kafka since you will be "stresstesting" you production flow - consider if you at some later time lost your destination tables - how would you then repopulate them? It would be nice to know that your normal flow handles this situation. 2014-10-23 12:08 GMT+02:00 Po Cheung <poche...@yahoo.com.invalid>: > Hello, > > We are planning to set up a data pipeline and send periodic, incremental > updates from Teradata to Hadoop via Kafka. For a large DW table with > hundreds of GB of data, is it okay (in terms of performance) to use Kafka > for the initial bulk data load? Or will Sqoop with Teradata connector be > more appropriate? > > > Thanks, > Po