Monday, December 10, 2007

A day and a half in the life...

This is a story about Windows, Samba, debugging, and frustration. I just spent a day and a half debugging a very annoying performance problem with a brand-new system.

Background: I just recently bought a new desktop computer, which I do about every two years. Computers should last longer than that, but I find that, even with good "hygeine", Windows systems tend to decay to the point where they exhibit weird behaviors after about two years, for which the cure is a complete reinstall of the OS and all applications. (Pause for Mac fanboys to snicker.) The "rebuild the world" process wouldn't be so bad if it weren't so hard to migrate all one's data -- even given the fact that a lot of my data is already in subversion. The real problem is that each application sprays its configuration and data randomly around your system, whether in the program install folder, registry, documents and settings folder, local settings folder, etc.

So, I bought a new computer, an extra-quiet one from www.endpcnoise.com. These guys specialize in quiet systems, and since I work in a home environment, the computer is usually the noisiest thing in the room. (Of particular annoyance is the variable speed fan on my existng Dell system, which, every time the system worked up a sweat, made a whiny noise. And we was fined $50 and had to pick up the garbage in the snow, but thats not what I came to tell you about.) Pretty happy with the new system overall.

So, I installed XP Pro on the new system, and proceeded to install all my applications, utilities, and all kinds of groovy things that we were talking about on the bench. And then I got to the part where I tried to use it; specifically, tried to fire up IDEA and build the project I'm working on.

Now, I've got a somewhat weird setup; when developing on a project, I checkout a workspace on my Linux server, which is served via SAMBA to my local network, and I run the IDE on my Windows desktop and point the IDE at my samba share. There's a measurable performance hit vs local, but I like being able to do some things from the Linux command line and other things from the IDE, so overall its a more productive setup for me.

When you ask the IDE to "make" the project, it crawls the files in the project checking their modification times. An "empty" make on a project with ~1000 files generally takes a few seconds to figure out that there's nothing to do. But when I set it up, the new (faster) system took about 30-60s to do an empty make.

OK, it's debugging time. What's different about the two systems? Same OS (XP Pro), same service pack level, both systems up to date on patches, same Java version, same IDE version, same user credentials, no host-specific information in my SAMBA config. Different hardware. Make sure my ethernet drivers are up-to-date. Test network for errors, swap cables, all that. Run IOMeter, found that both get similar throughput for large files on the same SAMBA share.

Crank up perfmon, which tells me that the new system is sending out more packets for a make than the old one. OK, crank up ethereal, get a packet capture, and find that the new system is sending/receiving 10x as many packets for the same operation:

[brian@brian-server ~]$ wc -l /tmp/*cap
258204 /tmp/new-cap
17719 /tmp/old-cap


So, what's the difference? Let's look at the packet capture. In the old trace, for each file being probed, it did something like this:

0.467895 192.168.1.104 -> 192.168.1.107 SMB Trans2 Request, QUERY_PATH_INFO, Query File Basic Info, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api\tree\OnDeleteElementTree.class
0.468041 192.168.1.107 -> 192.168.1.104 SMB Trans2 Response, QUERY_PATH_INFO
0.468283 192.168.1.104 -> 192.168.1.107 SMB Trans2 Request, QUERY_PATH_INFO, Query File Network Open Info, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api\tree\OnDeleteElementTree.class
0.468402 192.168.1.107 -> 192.168.1.104 SMB Trans2 Response, QUERY_PATH_INFO


Two requests, two responses per file. Seemed reasonable. On the new system, for each file:

2.010471 192.168.1.113 -> 192.168.1.107
SMB NT Create AndX Request, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api\tree
2.010698 192.168.1.107 -> 192.168.1.113 SMB NT Create AndX Response, FID: 0x2754
2.010900 192.168.1.113 -> 192.168.1.107 SMB NT Create AndX Request, Path: \
2.011011 192.168.1.107 -> 192.168.1.113 SMB NT Create AndX Response, FID: 0x2755
2.011237 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, FIND_FIRST2, Pattern: \work
2.011570 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, FIND_FIRST2, Files: work
2.011752 192.168.1.113 -> 192.168.1.107 SMB Close Request, FID: 0x2755
2.011833 192.168.1.107 -> 192.168.1.113 SMB Close Response
2.012025 192.168.1.113 -> 192.168.1.107 SMB NT Create AndX Request, Path: \work\openjfx-compiler
2.012157 192.168.1.107 -> 192.168.1.113 SMB NT Create AndX Response, FID: 0x2756
2.012353 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, FIND_FIRST2, Pattern: \work\openjfx-compiler\classes
2.012631 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, FIND_FIRST2, Files: classes
2.012796 192.168.1.113 -> 192.168.1.107 SMB Close Request, FID: 0x2756
2.012897 192.168.1.107 -> 192.168.1.113 SMB Close Response
2.013100 192.168.1.113 -> 192.168.1.107 SMB NT Create AndX Request, Path: \work\openjfx-compiler\classes\production\openjfx-compiler
2.013239 192.168.1.107 -> 192.168.1.113 SMB NT Create AndX Response, FID: 0x2757
2.013445 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, FIND_FIRST2, Pattern: \work\openjfx-compiler\classes\production\openjfx-compiler\com
2.013894 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, FIND_FIRST2, Files: com
2.014095 192.168.1.113 -> 192.168.1.107 SMB Close Request, FID: 0x2757
2.014174 192.168.1.107 -> 192.168.1.113 SMB Close Response
2.014355 192.168.1.113 -> 192.168.1.107 SMB NT Create AndX Request, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com
2.014504 192.168.1.107 -> 192.168.1.113 SMB NT Create AndX Response, FID: 0x2758
2.014962 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, FIND_FIRST2, Pattern: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun
2.015169 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, FIND_FIRST2, Files: sun
2.015339 192.168.1.113 -> 192.168.1.107 SMB Close Request, FID: 0x2758
2.015428 192.168.1.107 -> 192.168.1.113 SMB Close Response
2.015633 192.168.1.113 -> 192.168.1.107 SMB NT Create AndX Request, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun
2.015764 192.168.1.107 -> 192.168.1.113 SMB NT Create AndX Response, FID: 0x2759
2.015980 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, FIND_FIRST2, Pattern: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx
2.016221 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, FIND_FIRST2, Files: javafx
2.016402 192.168.1.113 -> 192.168.1.107 SMB Close Request, FID: 0x2759
2.016493 192.168.1.107 -> 192.168.1.113 SMB Close Response
2.016693 192.168.1.113 -> 192.168.1.107 SMB NT Create AndX Request, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx
2.016827 192.168.1.107 -> 192.168.1.113 SMB NT Create AndX Response, FID: 0x275a
2.017096 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, FIND_FIRST2, Pattern: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api
2.017348 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, FIND_FIRST2, Files: api
2.017520 192.168.1.113 -> 192.168.1.107 SMB Close Request, FID: 0x275a
2.017590 192.168.1.107 -> 192.168.1.113 SMB Close Response
2.017803 192.168.1.113 -> 192.168.1.107 SMB NT Create AndX Request, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api
2.017919 192.168.1.107 -> 192.168.1.113 SMB NT Create AndX Response, FID: 0x275b
2.018133 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, FIND_FIRST2, Pattern: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api\tree
2.018389 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, FIND_FIRST2, Files: tree
2.018547 192.168.1.113 -> 192.168.1.107 SMB Close Request, FID: 0x275b
2.018626 192.168.1.107 -> 192.168.1.113 SMB Close Response
2.018779 192.168.1.113 -> 192.168.1.107 SMB Close Request, FID: 0x2754
2.018851 192.168.1.107 -> 192.168.1.113 SMB Close Response
2.019157 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, QUERY_PATH_INFO, Query File Basic Info, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api\tree\OnDeleteElementTree.class
2.019292 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, QUERY_PATH_INFO
2.019495 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, QUERY_PATH_INFO, Query File Standard Info, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api\tree\OnDeleteElementTree.class
2.019613 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, QUERY_PATH_INFO
2.019832 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, QUERY_PATH_INFO, Query File Internal Info, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api\tree\OnDeleteElementTree.class
2.019960 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, QUERY_PATH_INFO
2.020206 192.168.1.113 -> 192.168.1.107 SMB Trans2 Request, QUERY_PATH_INFO, Query File Network Open Info, Path: \work\openjfx-compiler\classes\production\openjfx-compiler\com\sun\javafx\api\tree\OnDeleteElementTree.class
2.020316 192.168.1.107 -> 192.168.1.113 SMB Trans2 Response, QUERY_PATH_INFO


For those of you who don't enjoy reading packet dumps (I hope that's all of you), what's going on here is that it does some sort of complicated multipacket transaction for each level of directory from the project root down to the last directory in the chain, and then does four request-responses for each file. And it repeats the directory stuff for every file, even though it just asked that.

OK, more debugging -- what could cause a system to deviate from the standard file system client behavior? Check all the network control panel settings, they're all the same. Spend several hours googling through the MS knowledge base for file sharing related problems, look at the various registry keys and file versions mentioned, nope, none of them are helpful. Google for people who have had similar problems. Many have, but no one reported a solution that works, except one person, who mentioned that their network behavior changed when they changed versions of Symantec antivirus. Well, I don't run Symantec AV, but I do run ZoneAlarm. And I do have different versions -- ZA Antivirus on the old system, ZA Suite on the new. Seems like a small difference -- they're clearly built on the same base technology -- but lets try it. Disabled ZA, reboot, make sure its not running, and run my IDE again -- no change. Still an annoying 30-60s delay before it figures out there's nothing to rebuild.

At this point, I asked my friends for help. Lots of sympathy. Lots of "check this, check that", but very little advice that actually moved me towards a solution (sorry, guys).

That post about the guy with the Symantec problem gnawed at me, though. I know security programs intercept a lot of network traffic, so the theory was perfectly plausible, and the best theory I had so far. I did the "disable ZA" thing again, rebooted, and cranked up Rootkit Hook Analyzer to see if ZA still had anything hooked, and it did, even though there were no ZA processes running and the ZA TrueVector service was stopped. So, I ran the uninstaller for ZA Suite, rebooted, checked with RHA to see that everything was unhooked (it was), and ran my IDE test again -- and this time, sweet success!

So, the conclusion is that the ZA Suite interferes with file sharing client behavior in a rather fundamental way (but one which only has a noticeable affect when dealing with lots of small files).

So, my system is temporarily defenseless against malware while I decide what to do. Why on earth would ZA rewrite the file system client packet stream like that? I want to send them a bill for that day and a half.

14 comments:

  1. You should install Cygwin on your Windows machine. This gives you the power of a Linux commandline, but running on Windows. It should allow you to forget about the share alltogether.

    ReplyDelete
  2. Update: I tried backing off to ZA AV and then to the ZA standalone (free) firewall on the new machine -- and each time, the same problem came up, but only after downloading the latest update. I'm guessing that updating ZA on the old machine would cause the same problem. So I'm back to Windows firewall right now.

    I spent some time researching AV/firewall software. Seems like people hate them all.

    ReplyDelete
  3. Several people have commented that I was pretty lucky that the uninstaller actually uninstalled everything.

    ReplyDelete
  4. Re Cygwin: I've actually got cygwin installed. But I've always been too aware of the ways in which its not a real Linux system.

    ReplyDelete
  5. Those things tend to become more and more widespread. Some colleagues experience similar slowdowns because of ZA which is part of our VPN infrastructure. Even when on the local network and everything is set to "pass-thru" their notebooks tend to be significantly slower than the desktop PCs that do not have ZA installed.
    Combine that with Symantec AV and you just need the new machine to keep up with the increased load the security tools pose ;)

    Other products are not necessarily better. I had a case with Kerio Personal Firewall not uninstalling properly - leading to the system freezing upon upgrading the (F-Secure) Antivirus - of course both companies blame the other when you ask for help...

    ReplyDelete
  6. @ds: Agree that other products are not necessarily better. I've had good luck with ZA in the past, after having tried a number of them. But the competitive forces are pushing these companies to add more and more features to their products, and they become larger and more intrusive. The irony is that many users give up on using any security software because they're all too heavyweight.

    ReplyDelete
  7. Snicker. (From a Mac fanboy.)

    Seriously, one of the best ways to run Windows is in a virtual environment on a Mac (or any other system for that matter). That way, if the thing crashes, or you need to reinstall, or whatever, you can let it do its thing in the background. It's also easier to manage multiple configurations, so that if you're afraid an update will break something, you can take a snapshot you can roll back to if something goes wrong.

    On another note, it does seem to be commonly held lore that Windows systems will inexorably slow down after a couple years of use. Has anyone ever figured out why this occurs and what can be done about it, short of a reinstall?

    ReplyDelete
  8. Actually, in a way, I got really lucky -- the broken update happened at exactly the time I got a new system, which allowed me to measure the old behavior and new behavior side-by-side. Without that, I might never have been able to gather enough data to find the problem.

    ReplyDelete
  9. Snicker. (From an Ubuntu Linux fanboy.)

    Seriously, Brian, what are your reasons to use Windows? You say "I like being able to do some things from the Linux command line and other things from the IDE". It's not like you can't have both on the same computer :-) Combination of CrossOver Office and Windows in a Virtual machine solved my Windows software needs. I see the advantages of moving to Linux (or Mac) - faster, simpler setup, no OS decay, you are already comfortable with Unix environment. What are the downsides for you?

    BTW, in my experience build on Linux is much faster than on Windows (http://blog.mousefeed.com/2007/12/faster-build-compilation-with-ram-disk.html).

    ReplyDelete
  10. Update: installed Kaspersky anti-virus, impact seems to be acceptably low.

    ReplyDelete
  11. The biggest downside for desktop linux is device support. Any weird devices (like nine-button trackballs) have much better device support on Windows. While its often possible to configure drivers to do similar things, it requires a lot of care and feeding.

    ReplyDelete
  12. Thank you for the quick response. Did not have to deal with the monster mice like the one you mentioned :-) If this is the main reason you are on Windows, you are in a pretty good shape. I'd expect the hardware support problems you mentioned to be resolved soon, certainly by the next time you need to upgrade your computer.

    ReplyDelete
  13. I have found that Zone Alarm has another strange behavior that it seems only to exhibit on Vista. The Tivo desktop software demonstrates the problem, but I wrote some code to test the situation with jmdns. It turns out that if your multicast application depends on receiving the packets locally as well as sending them to the network, ZoneAlarm blocks it. You can only fix the problem by uninstalling ZoneAlarm. Since you mentioned the Rootkit Hook Analyzer, I might try out testing it again to see if I can figure out if the problems are directly related.

    ReplyDelete
  14. Brian, excellent blog entry. I saw this same behavior and your blog pointed me at Symantec Corp AV as the culprit. Opened up a call with Symantec because we did not check 'scan network drive' but their check box is backwards in 10.0 vs. 10.1 and checking the box actually turned it on the 'do not scan' bass-ackwards!! I will try to get more info on exactly why Symantec does this and send it along. Thanks for the help, the application that it was breaking was taking a lot of heat.

    ReplyDelete