I'm a little swamp so it's hard to find time to write a blog post, but recently I have worked on an issue that's worth the time to blog about.
Here's a frustrating yet optimistic example of why writing code inside a third party complex system is hard.
Started with a hang
I have received a report that for one of the users Outlook froze with 100% CPU spike after about 3 days of regular usage.
Fortunately I was able to get two memory dump of outlook process at that time, but what I saw was not really encouraging.
The first one is disappointing. Using '!runaway' command I know that the thread of interest is 0 – the main thread of outlook:
0:000> kb ChildEBP RetAddr Args to Child 002efd60 759b06eb 002efe2c 00000000 00000000 user32!NtUserPeekMessage+0x15 002efd88 759b60dc 002efe2c 00000000 00000000 user32!_PeekMessage+0x88 *** ERROR: Symbol file could not be found. Defaulted to export symbols for MSO.DLL - 002efdb4 5b1d91fa 002efe2c 00000000 00000000 user32!PeekMessageA+0x143 *** ERROR: Symbol file could not be found. Defaulted to export symbols for OUTLOOK.EXE - WARNING: Stack unwind information not available. Following frames may be wrong. 002efddc 2f5ef04d 002efe2c 00000000 00000000 MSO!Ordinal7300+0x29 002efe04 2f6d032d 002efe2c 00000000 00000000 OUTLOOK!StdCoCreateInstance+0x73bec ... 0:000> dd 002efe2c 002efe2c 00151a6a 00000456 00161bd0 00000000 002efe3c 27ef61c0 0000076f 00000074 00000001
The main thread is just pumping the message loop and the current message is 0x456 which means it's a custom message, so not much to go with here…
The second memory dump is a little more interesting:
0:000> kb ChildEBP RetAddr Args to Child 002eeadc 759a61cf 75a10230 00000001 759a6cee kernel32!InterlockedIncrement+0xd 002eeae8 759a6cee 863e9520 312d6550 312d6548 user32!_BeginIfHookedUserApiHook+0xf 002eeb4c 759a77c4 00000000 30462bf4 00151a6a user32!UserCallWinProcCheckWow+0x6f 002eebac 759a788a 30462bf4 00000000 002eebcc user32!DispatchMessageWorker+0x3bc 002eebbc 304d5f08 312d6548 312d6548 002eec00 user32!DispatchMessageW+0xf WARNING: Stack unwind information not available. Following frames may be wrong. 002eebcc 30d487dd 312d6548 00000000 00000000 WWLIB !FMain+0xd1961 002eec00 30a24569 002efaac 00000000 00000001 WWLIB!wdCommandDispatch+0x3d346a 002efaa4 30bfcaf7 0000010f 00000056 00000000 WWLIB!wdCommandDispatch+0xaf1f6 ... 0:000> dd 312d6548 312d6548 00151a6a 00000456 00360c04 00000000 312d6558 27f1c053 00000814 00000033 00000000
Here we can see the main thread is pushing a message into the message loop, and it's the same message code! So my assumption was that the 100% CPU is caused by a loop between OUTLOOK and WWLIB playing around the message loop. Investigating into who is WWLIB revealed that it's the word rich text editor that outlook is using in it's inspectors (like new email windows, post item, etc.) and looking at the logs I realized that the hang occurred after an inspector window was closed, but not right after, another inspector window needs to be open for the hang to manifest.
We have a lead!
Continued with a crash
A day or two after the hang report our QA had managed to crash Outlook by performing an operation that resulted in closing the active inspector window (I'm jumping a little as the first analysis is not very interesting). A quick look into "Event Viewer" showed a familiar faulting module 'WWLIB' but process crash dump revealed it was "Access Violation" so no much help there.
But the important part, that changed everything, it was reproducible!
So I started playing around to pin-point what was causing the crash and not very surprisingly I have managed to reproduce the hang as well. Not sure what exactly causes the hang to became a crash, it has something to do with the folder or the item currently selected, but it's not really the point here.
After a short time I stripped the code to the bare minimum that still resulted in the hang:
1. Outlook 2007 (2003 doesn't have WWLIB, 2010 doesn't repro)
2. Have process wide keyboard hook using 'SetWindowsHookEx' [ref 1] [ref 2]
3. Hook Ctrl+Enter keyboard combination event to close the current active inspector AND don't continue the event (CallNextHookEx)
1. Open outlook
2. Open new mail inspector
3. Set focus on the body word rich text editor of the new mail
4. Type 'a' and QUICKLY hit Ctrl+Enter
5. Open another new mail inspector (not really important, just for completeness)
1. It happens only with Ctrl+Enter, hooking any other key combinations does not result in hang.
2. It happens only if before pressing Ctrl+Enter there was a keyboard event that resulted in change to the characters appearing in the editor (adding/deleting), but the delay between must not be above 2-3 sec, after that everything fine.
3. The code must not propagate the keyboard event to the next handler, if does it will not result in hang.
On Ctrl+Enter keyboard event in the hook don't close the inspector directly but delay its closing by posting the close command to the main thread after a short delay, meanwhile propagate the keyboard event to the next hook to let the word control 'handle' it.
- Couldn't have identified the real reason for the issue without a consistent repro.
- It's very important to understand what happened before process hang/crash as it's not always textbook divide by zero error.
- Having memory dumps didn't really help to identify the core issue but helped to verify it was related.
- The code, basically, does nothing wrong, there is no way a developer could have prevented it.
- Developing Outlook add-in is hard.