Flutterby™! : Curious benchmarks

Next unread comment / Catchup all unread comments User Account Info | Logout | XML/Pilot/etc versions | Long version (with comments) | Weblog archives | Site Map | | Browse Topics

Curious benchmarks

2007-11-28 23:08:46.67622+00 by Dan Lyke 7 comments

Interesting. I'm playing with pthreads a bit, so I've got a simple loop that increments a variable about a billion times, once just flat out, once locking around the increment. Linux laptop, Intel Core Duo T2450 at 2.00GHz, 3.6 seconds without locks, 58.8 seconds with locks. Mac Pro, 2 Dual-Core Intel Xeons at 2.66GHz, 2.7 seconds without locks, 62.1 seconds with locks.

Stupid benchmark, not significant in any way, except that it often feels that the Linux laptop is way snappier than the Mac desktop (and the Mac laptop that's in the shop right now), if the overhead of the OS and system library implementations is chewing up that additional CPU speed that may explain a lot...

[ related topics: Open Source Macintosh ]

comments in ascending chronological order (reverse):

#Comment Re: made: 2007-11-28 23:49:53.445403+00 by: Jim S

One significant hardware difference: I suspect the dual processor machine has to get the lock written out to main memory while the T2450 only has to get it as far as the level 2 cache... or the linux pthread lock is just faster.

#Comment Re: made: 2007-11-29 00:07:13.57475+00 by: Dan Lyke [edit history]

I'd expect that there'd be no particular need for the processor local cache to get flushed, so I chalked it up to the Linux pthread lock taking roughly 8 units to 11 units OS/X (where a unit is one iteration of the while (abc < 999999999) abc++; loop).

Edit: Huh, duh, I was just being obtuse there, I need to go look at how multi-processor Xeon boxes communicate internal cache dirty status to each other.

#Comment Re: made: 2007-11-29 06:36:02.204736+00 by: spc476

Hmmm ... the locking seems to really take a toll. I did the following:

loop: mov eax,[gv] int eax mov [gv],eax cmp eax,1000000000 jl loop

And got 2.454s on a 2.6GHz dual Pentium system (running the code on a single core). I then did a spinlock version:

loop: mov al,1 spin: xchg al,[glock] or al,al jne spin mov eax,[gv] int eax mov [gv],eax move byte [glock],0 cmp eax,1000000000 jl loop

on the same system, and with one core running this segment, had a runtime of 39.752s. Even simple spin locks are expensive.

#Comment Re: made: 2007-11-29 07:53:50.425362+00 by: spc476

I reran the test, this time dual-threaded (dual-core Pentium). I got some ... um ... curious results.

#Comment Re: made: 2007-11-29 16:10:21.597591+00 by: Dan Lyke

If you feel like going further with that, how about getting rid of the 8 bit code in the locks and using full 16 or 32 bit words. The sequence:

t2:             mov	al,1
t2.wait:        xchg	al,[glock]
                or      al,al
                jne     t2.wait

Just screams stalls in the pipelining while the processor is splitting that poor register apart. I'll bet using ax or eax will give you at least 2x.

#Comment Re: made: 2007-11-29 19:29:57.831987+00 by: spc476

16 would be worse (operand length overrides on the x86), and going full 32 bit didn't help.

#Comment Re: made: 2007-11-29 19:36:25.563757+00 by: Dan Lyke

Huh. Weird. Thanks for the update.