Monday, December 10, 2007

A day and a half in the life...

This is a story about Windows, Samba, debugging, and frustration. I just spent a day and a half debugging a very annoying performance problem with a brand-new system.

Background: I just recently bought a new desktop computer, which I do about every two years. Computers should last longer than that, but I find that, even with good "hygeine", Windows systems tend to decay to the point where they exhibit weird behaviors after about two years, for which the cure is a complete reinstall of the OS and all applications. (Pause for Mac fanboys to snicker.) The "rebuild the world" process wouldn't be so bad if it weren't so hard to migrate all one's data -- even given the fact that a lot of my data is already in subversion. The real problem is that each application sprays its configuration and data randomly around your system, whether in the program install folder, registry, documents and settings folder, local settings folder, etc.

So, I bought a new computer, an extra-quiet one from www.endpcnoise.com. These guys specialize in quiet systems, and since I work in a home environment, the computer is usually the noisiest thing in the room. (Of particular annoyance is the variable speed fan on my existng Dell system, which, every time the system worked up a sweat, made a whiny noise. And we was fined $50 and had to pick up the garbage in the snow, but thats not what I came to tell you about.) Pretty happy with the new system overall.

So, I installed XP Pro on the new system, and proceeded to install all my applications, utilities, and all kinds of groovy things that we were talking about on the bench. And then I got to the part where I tried to use it; specifically, tried to fire up IDEA and build the project I'm working on.

Now, I've got a somewhat weird setup; when developing on a project, I checkout a workspace on my Linux server, which is served via SAMBA to my local network, and I run the IDE on my Windows desktop and point the IDE at my samba share. There's a measurable performance hit vs local, but I like being able to do some things from the Linux command line and other things from the IDE, so overall its a more productive setup for me.

When you ask the IDE to "make" the project, it crawls the files in the project checking their modification times. An "empty" make on a project with ~1000 files generally takes a few seconds to figure out that there's nothing to do. But when I set it up, the new (faster) system took about 30-60s to do an empty make.

OK, it's debugging time. What's different about the two systems? Same OS (XP Pro), same service pack level, both systems up to date on patches, same Java version, same IDE version, same user credentials, no host-specific information in my SAMBA config. Different hardware. Make sure my ethernet drivers are up-to-date. Test network for errors, swap cables, all that. Run IOMeter, found that both get similar throughput for large files on the same SAMBA share.

Crank up perfmon, which tells me that the new system is sending out more packets for a make than the old one. OK, crank up ethereal, get a packet capture, and find that the new system is sending/receiving 10x as many packets for the same operation:

[brian@brian-server ~]$ wc -l /tmp/*cap
258204 /tmp/new-cap
17719 /tmp/old-cap


So, what's the difference? Let's look at the packet capture. In the old trace, for each file being probed, it did something like this:

0.467895 192.168.1.104 -> 192.168.1.107 SMB Trans2 Request, QUERY_PATH_INFO, Query File Basic Info, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api\tree\OnDeleteElementTree.class
0.468041 192.168.1.107 -> 192.168.1.104 SMB Trans2 Response, QUERY_PATH_INFO
0.468283 192.168.1.104 -> 192.168.1.107 SMB Trans2 Request, QUERY_PATH_INFO, Query File Network Open Info, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api\tree\OnDeleteElementTree.class
0.468402 192.168.1.107 -> 192.168.1.104 SMB Trans2 Response, QUERY_PATH_INFO


Two requests, two responses per file. Seemed reasonable. On the new system, for each file:

2.010471 192.168.1.113 -> 192.168.1.107
SMB NT Create AndX Request, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api\tree
2.010698 192.168.1.107 -> 192.168.1.113 SMB NT Create AndX Response, FID: 0x2754
2.010900 192.168.1.113 -> 192.168.1.107 SMB NT Create AndX Request, Path: \
2.011011 192.168.1.107 -> 192.168.1.113 SMB NT Create AndX Response, FID: 0x2755
2.011237 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, FIND_FIRST2, Pattern: \work
2.011570 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, FIND_FIRST2, Files: work
2.011752 192.168.1.113 -> 192.168.1.107 SMB Close Request, FID: 0x2755
2.011833 192.168.1.107 -> 192.168.1.113 SMB Close Response
2.012025 192.168.1.113 -> 192.168.1.107 SMB NT Create AndX Request, Path: \work\openjfx-compiler
2.012157 192.168.1.107 -> 192.168.1.113 SMB NT Create AndX Response, FID: 0x2756
2.012353 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, FIND_FIRST2, Pattern: \work\openjfx-compiler\classes
2.012631 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, FIND_FIRST2, Files: classes
2.012796 192.168.1.113 -> 192.168.1.107 SMB Close Request, FID: 0x2756
2.012897 192.168.1.107 -> 192.168.1.113 SMB Close Response
2.013100 192.168.1.113 -> 192.168.1.107 SMB NT Create AndX Request, Path: \work\openjfx-compiler\classes\production\openjfx-compiler
2.013239 192.168.1.107 -> 192.168.1.113 SMB NT Create AndX Response, FID: 0x2757
2.013445 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, FIND_FIRST2, Pattern: \work\openjfx-compiler\classes\production\openjfx-compiler\com
2.013894 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, FIND_FIRST2, Files: com
2.014095 192.168.1.113 -> 192.168.1.107 SMB Close Request, FID: 0x2757
2.014174 192.168.1.107 -> 192.168.1.113 SMB Close Response
2.014355 192.168.1.113 -> 192.168.1.107 SMB NT Create AndX Request, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com
2.014504 192.168.1.107 -> 192.168.1.113 SMB NT Create AndX Response, FID: 0x2758
2.014962 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, FIND_FIRST2, Pattern: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun
2.015169 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, FIND_FIRST2, Files: sun
2.015339 192.168.1.113 -> 192.168.1.107 SMB Close Request, FID: 0x2758
2.015428 192.168.1.107 -> 192.168.1.113 SMB Close Response
2.015633 192.168.1.113 -> 192.168.1.107 SMB NT Create AndX Request, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun
2.015764 192.168.1.107 -> 192.168.1.113 SMB NT Create AndX Response, FID: 0x2759
2.015980 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, FIND_FIRST2, Pattern: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx
2.016221 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, FIND_FIRST2, Files: javafx
2.016402 192.168.1.113 -> 192.168.1.107 SMB Close Request, FID: 0x2759
2.016493 192.168.1.107 -> 192.168.1.113 SMB Close Response
2.016693 192.168.1.113 -> 192.168.1.107 SMB NT Create AndX Request, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx
2.016827 192.168.1.107 -> 192.168.1.113 SMB NT Create AndX Response, FID: 0x275a
2.017096 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, FIND_FIRST2, Pattern: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api
2.017348 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, FIND_FIRST2, Files: api
2.017520 192.168.1.113 -> 192.168.1.107 SMB Close Request, FID: 0x275a
2.017590 192.168.1.107 -> 192.168.1.113 SMB Close Response
2.017803 192.168.1.113 -> 192.168.1.107 SMB NT Create AndX Request, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api
2.017919 192.168.1.107 -> 192.168.1.113 SMB NT Create AndX Response, FID: 0x275b
2.018133 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, FIND_FIRST2, Pattern: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api\tree
2.018389 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, FIND_FIRST2, Files: tree
2.018547 192.168.1.113 -> 192.168.1.107 SMB Close Request, FID: 0x275b
2.018626 192.168.1.107 -> 192.168.1.113 SMB Close Response
2.018779 192.168.1.113 -> 192.168.1.107 SMB Close Request, FID: 0x2754
2.018851 192.168.1.107 -> 192.168.1.113 SMB Close Response
2.019157 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, QUERY_PATH_INFO, Query File Basic Info, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api\tree\OnDeleteElementTree.class
2.019292 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, QUERY_PATH_INFO
2.019495 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, QUERY_PATH_INFO, Query File Standard Info, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api\tree\OnDeleteElementTree.class
2.019613 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, QUERY_PATH_INFO
2.019832 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, QUERY_PATH_INFO, Query File Internal Info, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api\tree\OnDeleteElementTree.class
2.019960 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, QUERY_PATH_INFO
2.020206 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, QUERY_PATH_INFO, Query File Network Open Info, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api\tree\OnDeleteElementTree.class
2.020316 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, QUERY_PATH_INFO


For those of you who don't enjoy reading packet dumps (I hope that's all of you), what's going on here is that it does some sort of complicated multipacket transaction for each level of directory from the project root down to the last directory in the chain, and then does four request-responses for each file. And it repeats the directory stuff for every file, even though it just asked that.

OK, more debugging -- what could cause a system to deviate from the standard file system client behavior? Check all the network control panel settings, they're all the same. Spend several hours googling through the MS knowledge base for file sharing related problems, look at the various registry keys and file versions mentioned, nope, none of them are helpful. Google for people who have had similar problems. Many have, but no one reported a solution that works, except one person, who mentioned that their network behavior changed when they changed versions of Symantec antivirus. Well, I don't run Symantec AV, but I do run ZoneAlarm. And I do have different versions -- ZA Antivirus on the old system, ZA Suite on the new. Seems like a small difference -- they're clearly built on the same base technology -- but lets try it. Disabled ZA, reboot, make sure its not running, and run my IDE again -- no change. Still an annoying 30-60s delay before it figures out there's nothing to rebuild.

At this point, I asked my friends for help. Lots of sympathy. Lots of "check this, check that", but very little advice that actually moved me towards a solution (sorry, guys).

That post about the guy with the Symantec problem gnawed at me, though. I know security programs intercept a lot of network traffic, so the theory was perfectly plausible, and the best theory I had so far. I did the "disable ZA" thing again, rebooted, and cranked up Rootkit Hook Analyzer to see if ZA still had anything hooked, and it did, even though there were no ZA processes running and the ZA TrueVector service was stopped. So, I ran the uninstaller for ZA Suite, rebooted, checked with RHA to see that everything was unhooked (it was), and ran my IDE test again -- and this time, sweet success!

So, the conclusion is that the ZA Suite interferes with file sharing client behavior in a rather fundamental way (but one which only has a noticeable affect when dealing with lots of small files).

So, my system is temporarily defenseless against malware while I decide what to do. Why on earth would ZA rewrite the file system client packet stream like that? I want to send them a bill for that day and a half.