The only error in the logs is which happened over 10 days ago was.
2014-02-22 01:41:27 b.s.d.nimbus [ERROR] Error when processing event
java.io.IOException: Unable to delete directory
/mnt/storm/nimbus/stormdist/test-25-1393022928.
at
org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:981)
~[commons-io-1.4.jar:1.4]
at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1381)
~[commons-io-1.4.jar:1.4]
at backtype.storm.util$rmr.invoke(util.clj:442)
~[storm-core-0.9.0.1.jar:na]
at backtype.storm.daemon.nimbus$do_cleanup.invoke(nimbus.clj:819)
~[storm-core-0.9.0.1.jar:na]
at
backtype.storm.daemon.nimbus$fn__5528$exec_fn__1229__auto____5529$fn__5534.invoke(nimbus.clj:896)
~[storm-core-0.9.0.1.jar:na]
at
backtype.storm.timer$schedule_recurring$this__3019.invoke(timer.clj:77)
~[storm-core-0.9.0.1.jar:na]
at
backtype.storm.timer$mk_timer$fn__3002$fn__3003.invoke(timer.clj:33)
~[storm-core-0.9.0.1.jar:na]
at backtype.storm.timer$mk_timer$fn__3002.invoke(timer.clj:26)
~[storm-core-0.9.0.1.jar:na]
at clojure.lang.AFn.run(AFn.java:24) ~[clojure-1.4.0.jar:na]
at java.lang.Thread.run(Thread.java:701) ~[na:1.6.0_27]
Its fine. I can rebuild a new cluster. Storm deploy makes it pretty easy.
Thanks for you help on this!
As for my other question.
If my trident batch interval is 500ms and I keep the spout pending and
batch size small enough, will I be able to get real time results (ie sub 2
seconds)? I've played with the various metrics (I literally have a
spreadsheet of parameters to results) and haven't been able to get it. Am
I just doing it wrong? What would the key parameters be? The complete
latency is 500 ms but trident seems to be way behind despite non of my
bolts having a capacity > 0.6. This may have to do with nimbus being
throttled so I will report back. But if there are people out there who
have done this kind of thing, Id like to know if Im missing an obvious
parameter or something.
Thanks,
S
On Sun, Mar 2, 2014 at 8:09 PM, Michael Rose <[email protected]>wrote:
> The fact that the process is being killed constantly is a red flag. Also,
> why are you running it as a client VM?
>
> Check your nimbus.log to see why it's restarting.
>
> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
> [email protected]
>
>
> On Sun, Mar 2, 2014 at 7:50 PM, Sean Solbak <[email protected]> wrote:
>
>> uintx ErgoHeapSizeLimit = 0
>> {product}
>> uintx InitialHeapSize := 27080896
>> {product}
>> uintx LargePageHeapSizeThreshold = 134217728
>> {product}
>> uintx MaxHeapSize := 698351616
>> {product}
>>
>>
>> so initial size of ~25mb and max of ~666 mb
>>
>> Its a client process (not server ie the command is "java -client
>> -Dstorm.options..."). The process gets killed and restarted continously
>> with a new PID (which makes getting the PID tough to get stats on). I dont
>> have VisualVM but if I run
>>
>> jstat -gc PID, I get
>>
>> S0C S1C S0U S1U EC EU OC OU PC
>> PU YGC YGCT FGC FGCT GCT
>> 832.0 832.0 0.0 352.9 7168.0 1115.9 17664.0 1796.0
>> 21248.0 16029.6 5 0.268 0 0.000 0.268
>>
>> At this point I'll likely just rebuild the cluster. Its not in prod yet
>> as I still need to tune it. I should have wrote 2 separate emails :)
>>
>> Thanks,
>> S
>>
>>
>>
>>
>> On Sun, Mar 2, 2014 at 7:10 PM, Michael Rose <[email protected]>wrote:
>>
>>> I'm not seeing too much to substantiate that. What size heap are you
>>> running, and is it near filled? Perhaps attach VisualVM and check for GC
>>> activity.
>>>
>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>> [email protected]
>>>
>>>
>>> On Sun, Mar 2, 2014 at 6:54 PM, Sean Solbak <[email protected]> wrote:
>>>
>>>> Here it is. Appears to be some kind of race condition.
>>>>
>>>> http://pastebin.com/dANT8SQR
>>>>
>>>>
>>>> On Sun, Mar 2, 2014 at 6:42 PM, Michael Rose
>>>> <[email protected]>wrote:
>>>>
>>>>> Can you do a thread dump and pastebin it? It's a nice first step to
>>>>> figure this out.
>>>>>
>>>>> I just checked on our Nimbus and while it's on a larger machine, it's
>>>>> using <1% CPU. Also look in your logs for any clues.
>>>>>
>>>>>
>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>> [email protected]
>>>>>
>>>>>
>>>>> On Sun, Mar 2, 2014 at 6:31 PM, Sean Solbak <[email protected]> wrote:
>>>>>
>>>>>> No, they are on seperate machines. Its a 4 machine cluster - 2
>>>>>> workers, 1 nimbus and 1 zookeeper.
>>>>>>
>>>>>> I suppose I could just create a new cluster but Id like to know why
>>>>>> this is occurring to avoid future production outages.
>>>>>>
>>>>>> Thanks,
>>>>>> S
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 6:19 PM, Michael Rose <[email protected]
>>>>>> > wrote:
>>>>>>
>>>>>>> Are you running Zookeeper on the same machine as the Nimbus box?
>>>>>>>
>>>>>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>>>>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>>>>>> [email protected]
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Mar 2, 2014 at 6:16 PM, Sean Solbak <[email protected]> wrote:
>>>>>>>
>>>>>>>> This is the first step of 4. When I save to db I'm actually saving
>>>>>>>> to a queue, (just using db for now). The 2nd step we index the data
>>>>>>>> and
>>>>>>>> 3rd we do aggregation/counts for reporting. The last is a search that
>>>>>>>> I'm
>>>>>>>> planning on using drpc for. Within step 2 we pipe certain datasets in
>>>>>>>> real
>>>>>>>> time to the clients it applies to. I'd like this and the drpc to be
>>>>>>>> sub 2s
>>>>>>>> which should be reasonable.
>>>>>>>>
>>>>>>>> Your right that I could speed up step 1 by not using trident but
>>>>>>>> our requirements seem like a good use case for the other 3 steps. With
>>>>>>>> many results per second batching should effect performance a ton if the
>>>>>>>> batch size is small enough.
>>>>>>>>
>>>>>>>> What would cause nimbus to be at 100% CPU with the topologies
>>>>>>>> killed?
>>>>>>>>
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>> On Mar 2, 2014, at 5:46 PM, Sean Allen <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Is there a reason you are using trident?
>>>>>>>>
>>>>>>>> If you don't need to handle the events as a batch, you are probably
>>>>>>>> going to get performance w/o it.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Mar 2, 2014 at 2:23 PM, Sean Solbak <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Im writing a fairly basic trident topology as follows:
>>>>>>>>>
>>>>>>>>> - 4 spouts of events
>>>>>>>>> - merges into one stream
>>>>>>>>> - serializes the object as an event in a string
>>>>>>>>> - saves to db
>>>>>>>>>
>>>>>>>>> I split the serialization task away from the spout as it was cpu
>>>>>>>>> intensive to speed it up.
>>>>>>>>>
>>>>>>>>> The problem I have is that after 10 minutes there is over 910k
>>>>>>>>> tuples emitted/transfered but only 193k records are saved.
>>>>>>>>>
>>>>>>>>> The overall load of the topology seems fine.
>>>>>>>>>
>>>>>>>>> - 536.404 ms complete latency at the topolgy level
>>>>>>>>> - The highest capacity of any bolt is 0.3 which is the
>>>>>>>>> serialization one.
>>>>>>>>> - each bolt task has sub 20 ms execute latency and sub 40 ms
>>>>>>>>> process latency.
>>>>>>>>>
>>>>>>>>> So it seems trident has all the records internally, but I need
>>>>>>>>> these events as close to realtime as possible.
>>>>>>>>>
>>>>>>>>> Does anyone have any guidance as to how to increase the
>>>>>>>>> throughput? Is it simply a matter of tweeking max spout pending and
>>>>>>>>> the
>>>>>>>>> batch size?
>>>>>>>>>
>>>>>>>>> Im running it on 2 m1-smalls for now. I dont see the need to
>>>>>>>>> upgrade it until the demand on the boxes seems higher. Although CPU
>>>>>>>>> usage
>>>>>>>>> on the nimbus box is pinned. Its at like 99%. Why would that be?
>>>>>>>>> Its at
>>>>>>>>> 99% even when all the topologies are killed.
>>>>>>>>>
>>>>>>>>> We are currently targeting processing 200 million records per day
>>>>>>>>> which seems like it should be quite easy based on what Ive read that
>>>>>>>>> other
>>>>>>>>> people have achieved. I realize that hardware should be able to
>>>>>>>>> boost this
>>>>>>>>> as well but my first goal is to get trident to push the records to
>>>>>>>>> the db
>>>>>>>>> quicker.
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>> Sean
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Ce n'est pas une signature
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>>
>>>>>> Sean Solbak, BsC, MCSD
>>>>>> Solbak Technologies Inc.
>>>>>> 780.893.7326 (m)
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>>
>>>> Sean Solbak, BsC, MCSD
>>>> Solbak Technologies Inc.
>>>> 780.893.7326 (m)
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>>
>> Sean Solbak, BsC, MCSD
>> Solbak Technologies Inc.
>> 780.893.7326 (m)
>>
>
>
--
Thanks,
Sean Solbak, BsC, MCSD
Solbak Technologies Inc.
780.893.7326 (m)