Re: Kafka stream message sampling

2016-05-16 Thread Samuel Zhou
Hi, Mich,

I created the Kafka DStream with following Java code:

sparkConf = new SparkConf().setAppName(this.getClass().getSimpleName() + ",
topic: " + topics);

jssc = new JavaStreamingContext(sparkConf, Durations.seconds(batchInterval
));

HashSet topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));

HashMap kafkaParams = new HashMap<>();

kafkaParams.put("metadata.broker.list", brokers);

dStream = KafkaUtils.createDirectStream(jssc, byte[].class, byte[].class,
DefaultDecoder.class, DefaultDecoder.class, kafkaParams, topicsSet);

Do you know if there a way to do sampling for a stream when creating it?

Thanks,

Samuel

On Mon, May 16, 2016 at 12:54 AM, Mich Talebzadeh  wrote:

> Hi Samuel,
>
> How do you create your RDD based on Kakfa direct stream?
>
> Do you have your code snippet?
>
> HTH
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 May 2016 at 23:24, Samuel Zhou  wrote:
>
>> Hi,
>>
>> I was trying to use filter to sampling a Kafka direct stream, and the
>> filter function just take 1 messages from 10 by using hashcode % 10 == 0,
>> but the number of events of input for each batch didn't shrink to 10% of
>> original traffic. So I want to ask if there are any way to shrink the batch
>> size by a sampling function to save the traffic from Kafka?
>>
>> Thanks!
>> Samuel
>>
>
>


Re: Kafka stream message sampling

2016-05-16 Thread Mich Talebzadeh
Hi Samuel,

How do you create your RDD based on Kakfa direct stream?

Do you have your code snippet?

HTH





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 May 2016 at 23:24, Samuel Zhou  wrote:

> Hi,
>
> I was trying to use filter to sampling a Kafka direct stream, and the
> filter function just take 1 messages from 10 by using hashcode % 10 == 0,
> but the number of events of input for each batch didn't shrink to 10% of
> original traffic. So I want to ask if there are any way to shrink the batch
> size by a sampling function to save the traffic from Kafka?
>
> Thanks!
> Samuel
>