NullPointerException while joining two avro Hive tables

2017-02-04 Thread Понькин Алексей

Hi,

I have a table in Hive(data is stored as avro files).
Using python spark shell I am trying to join two datasets

events = spark.sql('select * from mydb.events')

intersect = events.where('attr2 in (5,6,7) and attr1 in (1,2,3)')
intersect.count()

But I am constantly receiving the following

java.lang.NullPointerException
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:142)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:91)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:104)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:83)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.(AvroObjectInspectorGenerator.java:56)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:124)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$10.apply(TableReader.scala:251)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$10.apply(TableReader.scala:239)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:103)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Using Spark 2.0.0.2.5.0.0-1245

Any help will be appreciated


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [KafkaRDD]: rdd.cache() does not seem to work

2016-01-12 Thread Понькин Алексей
Hi Charles,

I have created very simplified job - https://github.com/ponkin/KafkaSnapshot to 
illustrate the problem. 
https://github.com/ponkin/KafkaSnapshot/blob/master/src/main/scala/ru/ponkin/KafkaSnapshot.scala

In a short - may be persist method is working but not like I expected.
I thought that spark will fetch all data from kafka topic once and cache it in 
memory, instead add is calculating every time I call saveAsObjectFile method

-- 
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1=1


12.01.2016, 10:56, "charles li" :
> cache is the default storage level of persist, and it is lazy [ not cached 
> indeed ] until the first time it is computed.
>
> ​
>
> On Tue, Jan 12, 2016 at 5:13 AM, ponkin  wrote:
>> Hi,
>>
>> Here is my use case :
>> I have kafka topic. The job is fairly simple - it reads topic and save data 
>> to several hdfs paths.
>> I create rdd with the following code
>>  val r =  
>> KafkaUtils.createRDD[Array[Byte],Array[Byte],DefaultDecoder,DefaultDecoder](context,kafkaParams,range)
>> Then I am trying to cache that rdd with
>>  r.cache()
>> and then save this rdd to several hdfs locations.
>> But it seems that KafkaRDD is fetching data from kafka broker every time I 
>> call saveAsNewAPIHadoopFile.
>>
>> How can I cache data from Kafka in memory?
>>
>> P.S. When I do repartition add it seems to work properly( read kafka only 
>> once) but spark store shuffled data localy.
>> Is it possible to keep data in memory?
>>
>> 
>> View this message in context: [KafkaRDD]: rdd.cache() does not seem to work
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-11-24 Thread Понькин Алексей
Great, thank you.
Sorry for being so inattentive) Need to read docs carefully.

-- 
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1=1


24.11.2015, 15:15, "Deng Ching-Mallete" :
> Hi,
>
> If you wish to read from checkpoints, you need to use 
> StreamingContext.getOrCreate(checkpointDir, functionToCreateContext) to 
> create the streaming context that you pass in to 
> KafkaUtils.createDirectStream(...). You may refer to 
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
>  for an example.
>
> HTH,
> Deng
>
> On Tue, Nov 24, 2015 at 5:46 PM, ponkin  wrote:
>> HI,
>>
>> When I create stream with KafkaUtils.createDirectStream I can explicitly 
>> define the position "largest" or "smallest" - where to read topic from.
>> What if I have previous checkpoints( in HDFS for example) with offsets, and 
>> I want to start reading from the last checkpoint?
>> In source code of KafkaUtils.createDirectStream I see the following
>>
>>  val reset = kafkaParams.get("auto.offset.reset").map(_.toLowerCase)
>>
>>      (for {
>>        topicPartitions <- kc.getPartitions(topics).right
>>        leaderOffsets <- (if (reset == Some("smallest")) {
>>          kc.getEarliestLeaderOffsets(topicPartitions)
>>        } else {
>>          kc.getLatestLeaderOffsets(topicPartitions)
>>        }).right
>> ...
>>
>> So it turns out that, I have no options to start reading from 
>> checkpoints(and offsets)?
>> Am I right?
>> How can I force Spark to start reading from saved offesets(in checkpoints)? 
>> Is it possible at all or I need to store offsets in external datastore?
>>
>> Alexey Ponkin
>>
>> 
>> View this message in context: [streaming] KafkaUtils.createDirectStream - 
>> how to start streming from checkpoints?
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [streaming] DStream with window performance issue

2015-09-09 Thread Понькин Алексей
That`s correct, I have 10 seconds batch.
The problem is actually in processing time, it is increasing constantly no 
matter how small or large my window duration is.
I am trying to prepare some example code to clarify my use case.

-- 
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1=1


09.09.2015, 17:04, "Cody Koeninger" <c...@koeninger.org>:
> It looked like from your graphs that you had a 10 second batch time, but that 
> your processing time was consistently 11 seconds.  If that's correct, then 
> yes your delay is going to keep growing.  You'd need to either increase your 
> batch time, or get your processing time down (either by adding more resources 
> or changing your code).
>
> I'd expect adding a repartition / shuffle to increase processing time, not 
> decrease it.  What are you seeing after adding the partitionBy call?
>
> On Tue, Sep 8, 2015 at 5:33 PM, Понькин Алексей <alexey.pon...@ya.ru> wrote:
>> Oh my, I implemented one directStream instead of union of three but it is 
>> still growing exponential with window method.
>>
>> --
>> Яндекс.Почта — надёжная почта
>> http://mail.yandex.ru/neo2/collect/?exp=1=1
>>
>> 08.09.2015, 23:53, "Cody Koeninger" <c...@koeninger.org>:
>>
>>> Yeah, that's the general idea.
>>>
>>> When you say hard code topic name, do you mean  Set(topicA, topicB, topicB) 
>>> ?  You should be able to use a variable for that - read it from a config 
>>> file, whatever.
>>>
>>> If you're talking about the match statement, yeah you'd need to hardcode 
>>> your cases.
>>>
>>> On Tue, Sep 8, 2015 at 3:49 PM, Понькин Алексей <alexey.pon...@ya.ru> wrote:
>>>> Ok. I got it!
>>>> But it seems that I need to hard code topic name.
>>>>
>>>> something like that?
>>>>
>>>> val source = KafkaUtils.createDirectStream[Array[Byte], Array[Byte], 
>>>> DefaultDecoder, DefaultDecoder](
>>>>   ssc, kafkaParams, Set(topicA, topicB, topicB))
>>>>   .transform{ rdd =>
>>>>     val offsetRange = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>>>>     rdd.mapPartitionsWithIndex(
>>>>       (idx: Int, itr: Iterator[(Array[Byte], Array[Byte])]) =>
>>>>         offsetRange(idx).topic match {
>>>>           case "topicA" => ...
>>>>           case "topicB" => ...
>>>>           case _ => 
>>>>         }
>>>>      )
>>>>     }
>>>>
>>>> 08.09.2015, 19:21, "Cody Koeninger" <c...@koeninger.org>:
>>>>> That doesn't really matter.  With the direct stream you'll get all 
>>>>> objects for a given topicpartition in the same spark partition.  You know 
>>>>> what topic it's from via hasOffsetRanges.  Then you can deserialize 
>>>>> appropriately based on topic.
>>>>>
>>>>> On Tue, Sep 8, 2015 at 11:16 AM, Понькин Алексей <alexey.pon...@ya.ru> 
>>>>> wrote:
>>>>>> The thing is, that these topics contain absolutely different AVRO 
>>>>>> objects(Array[Byte]) that I need to deserialize to different Java(Scala) 
>>>>>> objects, filter and then map to tuple (String, String). So i have 3 
>>>>>> streams with different avro object in there. I need to cast them(using 
>>>>>> some business rules) to pairs and unite.
>>>>>>
>>>>>> --
>>>>>> Яндекс.Почта — надёжная почта
>>>>>> http://mail.yandex.ru/neo2/collect/?exp=1=1
>>>>>>
>>>>>> 08.09.2015, 19:11, "Cody Koeninger" <c...@koeninger.org>:
>>>>>>
>>>>>>> I'm not 100% sure what's going on there, but why are you doing a union 
>>>>>>> in the first place?
>>>>>>>
>>>>>>> If you want multiple topics in a stream, just pass them all in the set 
>>>>>>> of topics to one call to createDirectStream
>>>>>>>
>>>>>>> On Tue, Sep 8, 2015 at 10:52 AM, Alexey Ponkin <alexey.pon...@ya.ru> 
>>>>>>> wrote:
>>>>>>>> Ok.
>>>>>>>> Spark 1.4.1 on yarn
>>>>>>>>
>>>>>>>> Here is my application
>>>>>>>> I have 4 different Kafka topics(different object streams)
>>>>>>>>
>>>>>>&

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Понькин Алексей
Thank you very much for great answer!

-- 
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1=1


08.09.2015, 23:53, "Cody Koeninger" <c...@koeninger.org>:
> Yeah, that's the general idea.
>
> When you say hard code topic name, do you mean  Set(topicA, topicB, topicB) ? 
>  You should be able to use a variable for that - read it from a config file, 
> whatever.
>
> If you're talking about the match statement, yeah you'd need to hardcode your 
> cases.
>
> On Tue, Sep 8, 2015 at 3:49 PM, Понькин Алексей <alexey.pon...@ya.ru> wrote:
>> Ok. I got it!
>> But it seems that I need to hard code topic name.
>>
>> something like that?
>>
>> val source = KafkaUtils.createDirectStream[Array[Byte], Array[Byte], 
>> DefaultDecoder, DefaultDecoder](
>>   ssc, kafkaParams, Set(topicA, topicB, topicB))
>>   .transform{ rdd =>
>>     val offsetRange = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>>     rdd.mapPartitionsWithIndex(
>>       (idx: Int, itr: Iterator[(Array[Byte], Array[Byte])]) =>
>>         offsetRange(idx).topic match {
>>           case "topicA" => ...
>>           case "topicB" => ...
>>           case _ => 
>>         }
>>      )
>>     }
>>
>> 08.09.2015, 19:21, "Cody Koeninger" <c...@koeninger.org>:
>>> That doesn't really matter.  With the direct stream you'll get all objects 
>>> for a given topicpartition in the same spark partition.  You know what 
>>> topic it's from via hasOffsetRanges.  Then you can deserialize 
>>> appropriately based on topic.
>>>
>>> On Tue, Sep 8, 2015 at 11:16 AM, Понькин Алексей <alexey.pon...@ya.ru> 
>>> wrote:
>>>> The thing is, that these topics contain absolutely different AVRO 
>>>> objects(Array[Byte]) that I need to deserialize to different Java(Scala) 
>>>> objects, filter and then map to tuple (String, String). So i have 3 
>>>> streams with different avro object in there. I need to cast them(using 
>>>> some business rules) to pairs and unite.
>>>>
>>>> --
>>>> Яндекс.Почта — надёжная почта
>>>> http://mail.yandex.ru/neo2/collect/?exp=1=1
>>>>
>>>> 08.09.2015, 19:11, "Cody Koeninger" <c...@koeninger.org>:
>>>>
>>>>> I'm not 100% sure what's going on there, but why are you doing a union in 
>>>>> the first place?
>>>>>
>>>>> If you want multiple topics in a stream, just pass them all in the set of 
>>>>> topics to one call to createDirectStream
>>>>>
>>>>> On Tue, Sep 8, 2015 at 10:52 AM, Alexey Ponkin <alexey.pon...@ya.ru> 
>>>>> wrote:
>>>>>> Ok.
>>>>>> Spark 1.4.1 on yarn
>>>>>>
>>>>>> Here is my application
>>>>>> I have 4 different Kafka topics(different object streams)
>>>>>>
>>>>>> type Edge = (String,String)
>>>>>>
>>>>>> val a = KafkaUtils.createDirectStream[...](sc,"A",params).filter( 
>>>>>> nonEmpty ).map( toEdge )
>>>>>> val b = KafkaUtils.createDirectStream[...](sc,"B",params).filter( 
>>>>>> nonEmpty ).map( toEdge )
>>>>>> val c = KafkaUtils.createDirectStream[...](sc,"C",params).filter( 
>>>>>> nonEmpty ).map( toEdge )
>>>>>>
>>>>>> val u = a union b union c
>>>>>>
>>>>>> val source = u.window(Seconds(600), Seconds(10))
>>>>>>
>>>>>> val z = KafkaUtils.createDirectStream[...](sc,"Z",params).filter( 
>>>>>> nonEmpty ).map( toEdge )
>>>>>>
>>>>>> val joinResult = source.rightOuterJoin( z )
>>>>>> joinResult.foreachRDD { rdd=>
>>>>>>   rdd.foreachPartition { partition =>
>>>>>>       // save to result topic in kafka
>>>>>>    }
>>>>>>  }
>>>>>>
>>>>>> The 'window' function in the code above is constantly growing,
>>>>>> no matter how many events appeared in corresponding kafka topics
>>>>>>
>>>>>> but if I change one line from
>>>>>>
>>>>>> val source = u.window(Seconds(600), Seconds(10))
>>>>>>
>>>>>> to
>>>>>>
>>&g

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Понькин Алексей
Oh my, I implemented one directStream instead of union of three but it is still 
growing exponential with window method. 

-- 
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1=1


08.09.2015, 23:53, "Cody Koeninger" <c...@koeninger.org>:
> Yeah, that's the general idea.
>
> When you say hard code topic name, do you mean  Set(topicA, topicB, topicB) ? 
>  You should be able to use a variable for that - read it from a config file, 
> whatever.
>
> If you're talking about the match statement, yeah you'd need to hardcode your 
> cases.
>
> On Tue, Sep 8, 2015 at 3:49 PM, Понькин Алексей <alexey.pon...@ya.ru> wrote:
>> Ok. I got it!
>> But it seems that I need to hard code topic name.
>>
>> something like that?
>>
>> val source = KafkaUtils.createDirectStream[Array[Byte], Array[Byte], 
>> DefaultDecoder, DefaultDecoder](
>>   ssc, kafkaParams, Set(topicA, topicB, topicB))
>>   .transform{ rdd =>
>>     val offsetRange = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>>     rdd.mapPartitionsWithIndex(
>>       (idx: Int, itr: Iterator[(Array[Byte], Array[Byte])]) =>
>>         offsetRange(idx).topic match {
>>           case "topicA" => ...
>>           case "topicB" => ...
>>           case _ => 
>>         }
>>      )
>>     }
>>
>> 08.09.2015, 19:21, "Cody Koeninger" <c...@koeninger.org>:
>>> That doesn't really matter.  With the direct stream you'll get all objects 
>>> for a given topicpartition in the same spark partition.  You know what 
>>> topic it's from via hasOffsetRanges.  Then you can deserialize 
>>> appropriately based on topic.
>>>
>>> On Tue, Sep 8, 2015 at 11:16 AM, Понькин Алексей <alexey.pon...@ya.ru> 
>>> wrote:
>>>> The thing is, that these topics contain absolutely different AVRO 
>>>> objects(Array[Byte]) that I need to deserialize to different Java(Scala) 
>>>> objects, filter and then map to tuple (String, String). So i have 3 
>>>> streams with different avro object in there. I need to cast them(using 
>>>> some business rules) to pairs and unite.
>>>>
>>>> --
>>>> Яндекс.Почта — надёжная почта
>>>> http://mail.yandex.ru/neo2/collect/?exp=1=1
>>>>
>>>> 08.09.2015, 19:11, "Cody Koeninger" <c...@koeninger.org>:
>>>>
>>>>> I'm not 100% sure what's going on there, but why are you doing a union in 
>>>>> the first place?
>>>>>
>>>>> If you want multiple topics in a stream, just pass them all in the set of 
>>>>> topics to one call to createDirectStream
>>>>>
>>>>> On Tue, Sep 8, 2015 at 10:52 AM, Alexey Ponkin <alexey.pon...@ya.ru> 
>>>>> wrote:
>>>>>> Ok.
>>>>>> Spark 1.4.1 on yarn
>>>>>>
>>>>>> Here is my application
>>>>>> I have 4 different Kafka topics(different object streams)
>>>>>>
>>>>>> type Edge = (String,String)
>>>>>>
>>>>>> val a = KafkaUtils.createDirectStream[...](sc,"A",params).filter( 
>>>>>> nonEmpty ).map( toEdge )
>>>>>> val b = KafkaUtils.createDirectStream[...](sc,"B",params).filter( 
>>>>>> nonEmpty ).map( toEdge )
>>>>>> val c = KafkaUtils.createDirectStream[...](sc,"C",params).filter( 
>>>>>> nonEmpty ).map( toEdge )
>>>>>>
>>>>>> val u = a union b union c
>>>>>>
>>>>>> val source = u.window(Seconds(600), Seconds(10))
>>>>>>
>>>>>> val z = KafkaUtils.createDirectStream[...](sc,"Z",params).filter( 
>>>>>> nonEmpty ).map( toEdge )
>>>>>>
>>>>>> val joinResult = source.rightOuterJoin( z )
>>>>>> joinResult.foreachRDD { rdd=>
>>>>>>   rdd.foreachPartition { partition =>
>>>>>>       // save to result topic in kafka
>>>>>>    }
>>>>>>  }
>>>>>>
>>>>>> The 'window' function in the code above is constantly growing,
>>>>>> no matter how many events appeared in corresponding kafka topics
>>>>>>
>>>>>> but if I change one line from
>>>>>>
>>>>>> val source = u.window(Seconds(600), Seconds(10))
>>&

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Понькин Алексей
The thing is, that these topics contain absolutely different AVRO 
objects(Array[Byte]) that I need to deserialize to different Java(Scala) 
objects, filter and then map to tuple (String, String). So i have 3 streams 
with different avro object in there. I need to cast them(using some business 
rules) to pairs and unite.

-- 
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1=1


08.09.2015, 19:11, "Cody Koeninger" :
> I'm not 100% sure what's going on there, but why are you doing a union in the 
> first place?
>
> If you want multiple topics in a stream, just pass them all in the set of 
> topics to one call to createDirectStream
>
> On Tue, Sep 8, 2015 at 10:52 AM, Alexey Ponkin  wrote:
>> Ok.
>> Spark 1.4.1 on yarn
>>
>> Here is my application
>> I have 4 different Kafka topics(different object streams)
>>
>> type Edge = (String,String)
>>
>> val a = KafkaUtils.createDirectStream[...](sc,"A",params).filter( nonEmpty 
>> ).map( toEdge )
>> val b = KafkaUtils.createDirectStream[...](sc,"B",params).filter( nonEmpty 
>> ).map( toEdge )
>> val c = KafkaUtils.createDirectStream[...](sc,"C",params).filter( nonEmpty 
>> ).map( toEdge )
>>
>> val u = a union b union c
>>
>> val source = u.window(Seconds(600), Seconds(10))
>>
>> val z = KafkaUtils.createDirectStream[...](sc,"Z",params).filter( nonEmpty 
>> ).map( toEdge )
>>
>> val joinResult = source.rightOuterJoin( z )
>> joinResult.foreachRDD { rdd=>
>>   rdd.foreachPartition { partition =>
>>       // save to result topic in kafka
>>    }
>>  }
>>
>> The 'window' function in the code above is constantly growing,
>> no matter how many events appeared in corresponding kafka topics
>>
>> but if I change one line from
>>
>> val source = u.window(Seconds(600), Seconds(10))
>>
>> to
>>
>> val partitioner = ssc.sparkContext.broadcast(new HashPartitioner(8))
>>
>> val source = u.transform(_.partitionBy(partitioner.value) 
>> ).window(Seconds(600), Seconds(10))
>>
>> Everything works perfect.
>>
>> Perhaps the problem was in WindowedDStream
>>
>> I forced to use PartitionerAwareUnionRDD( partitionBy the same partitioner ) 
>> instead of UnionRDD.
>>
>> Nonetheless I did not see any hints about such a bahaviour in doc.
>> Is it a bug or absolutely normal behaviour?
>>
>> 08.09.2015, 17:03, "Cody Koeninger" :
>>
>>>  Can you provide more info (what version of spark, code example)?
>>>
>>>  On Tue, Sep 8, 2015 at 8:18 AM, Alexey Ponkin  wrote:
  Hi,

  I have an application with 2 streams, which are joined together.
  Stream1 - is simple DStream(relativly small size batch chunks)
  Stream2 - is a windowed DStream(with duration for example 60 seconds)

  Stream1 and Stream2 are Kafka direct stream.
  The problem is that according to logs window operation is constantly 
 increasing(>>> href="http://en.zimagez.com/zimage/screenshot2015-09-0816-06-55.php;>screen).
  And also I see gap in pocessing window(>>> href="http://en.zimagez.com/zimage/screenshot2015-09-0816-10-22.php;>screen)
  in logs there are no events in that period.
  So what is happen in that gap and why window is constantly insreasing?

  Thank you in advance

  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [streaming] Using org.apache.spark.Logging will silently break task execution

2015-09-06 Thread Понькин Алексей
OK, I got it.
When I use 'yarn logs -applicationId ' command everything appears in 
right place.
Thank you!

-- 
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1=1


07.09.2015, 01:44, "Gerard Maas" :
> You need to take into consideration 'where' things are executing. The closure 
> of the 'forEachRDD'  executes in the driver. Therefore, the log statements 
> printed during the execution of that part will be found in the driver logs.
> In contrast, the foreachPartition closure executes on the worker nodes. You 
> will find the '+++ForEachPartition+++' messages printed in the executor log.
>
> So both statements execute, but in different locations of the distributed 
> computing environment (aka cluster)
>
> -kr, Gerard.
>
> On Sun, Sep 6, 2015 at 10:53 PM, Alexey Ponkin  wrote:
>> Hi,
>>
>> I have the following code
>>
>> object MyJob extends org.apache.spark.Logging{
>> ...
>>  val source: DStream[SomeType] ...
>>
>>  source.foreachRDD { rdd =>
>>       logInfo(s"""+++ForEachRDD+++""")
>>       rdd.foreachPartition { partitionOfRecords =>
>>         logInfo(s"""+++ForEachPartition+++""")
>>       }
>>   }
>>
>> I was expecting to see both log messages in job log.
>> But unfortunately you will never see string '+++ForEachPartition+++' in 
>> logs, cause block foreachPartition will never execute.
>> And also there is no error message or something in logs.
>> I wonder is this a bug or known behavior?
>> I know that org.apache.spark.Logging is DeveloperAPI, but why it is silently 
>> fails with no messages?
>> What to use instead of org.apache.spark.Logging? in spark-streaming jobs?
>>
>> P.S. running spark 1.4.1 (on yarn)
>>
>> Thanks in advance
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [spark-streaming] New directStream API reads topic's partitions sequentially. Why?

2015-09-05 Thread Понькин Алексей
Hi Cody,
Thank you for quick response.
The problem was that my application did not have enough resources(all executors 
were busy). So spark decided to run these tasks sequentially. When I add more 
executors for application everything goes fine.
Thank you anyway.
P.S. BTW thanks you for great video lecture about directStream 
https://youtu.be/fXnNEq1v3VA.

-- 
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1=1


04.09.2015, 17:03, "Cody Koeninger" :
> The direct stream just makes a spark partition per kafka partition, so if 
> those partitions are not getting evenly distributed among executors, 
> something else is probably wrong with your configuration.
>
> If you replace the kafka stream with a dummy rdd created with e.g. 
> sc.parallelize, what happens?
>
> Also, are you running kafka on one of the yarn executors, or on a different 
> machine?
>
> On Fri, Sep 4, 2015 at 5:17 AM, ponkin  wrote:
>> Hi,
>> I am trying to read kafka topic with new directStream method in KafkaUtils.
>> I have Kafka topic with 8 partitions.
>> I am running streaming job on yarn with 8 execuors with 1 core  for each
>> one.
>> So noticed that spark reads all topic's partitions in one executor
>> sequentially - this is obviously not what I want.
>> I want spark to read all partitions in parallel.
>> How can I achieve that?
>>
>> Thank you, in advance.
>>
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-New-directStream-API-reads-topic-s-partitions-sequentially-Why-tp24577.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Number of Partitions Recommendations

2015-08-02 Thread Понькин Алексей
Yes, I forgot to mention
I chose prime number as a modulo for hash function because my keys are usually 
strings and spark calculates particular partitiion using key hash(see 
HashPartitioner.scala) So, to avoid big number of collisions(when many keys 
located in few partition) it is common to use prime number in modulo. But it 
makes sense only for String keys offcourse, because of hash function. If yuo 
have different hash function for key of different type you can use any other 
modulo instead prime number.
I like this discussion on this topic 
http://stackoverflow.com/questions/1145217/why-should-hash-functions-use-a-prime-number-modulus


-- 
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1t=1


02.08.2015, 00:14, Ruslan Dautkhanov dautkha...@gmail.com:
 You should also take into account amount of memory that you plan to use.
 It's advised not to give too much memory for each executor .. otherwise GC 
 overhead will go up.

 Btw, why prime numbers?

 --
 Ruslan Dautkhanov

 On Wed, Jul 29, 2015 at 3:31 AM, ponkin alexey.pon...@ya.ru wrote:
 Hi Rahul,

 Where did you see such a recommendation?
 I personally define partitions with the following formula

 partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores ) )

 where
 nextPrimeNumberAbove(x) - prime number which is greater than x
 K - multiplicator  to calculate start with 1 and encrease untill join
 perfomance start to degrade

 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Number-of-Partitions-Recommendations-tp24022p24059.html

 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org