Accurate timing for the rabbit 4k series

i am currently writing the code for a 1-wire interface for the rabbit 4k series. I can see a need for the ability to get accurate timing control at the usec level. Counting clock cycles is strangely innacurate. I am running the processor full gas and the cycle count for my delays end up being much shorter than expect 0x0B90 instead of 0x1001. Periodic interrupt as a cause? I wrote the code in assy so i am a bit confused at the discrepency. Any ideas?

// drive output low
ld a,(PCDRShadow) //[9] Load a with val at addr of Port c data register
set 0,a //[4] set bit 0 (Drive) to enable pull down
ld (PCDRShadow),a //[10] put results back in shadow memory
ioi ld (PCDR),a //[2+9] store the result back to physical mem

//delay for 480us
ld bc,0x0b90 //[6]load value to count down to
loop0001:
dwjnz loop0001 //[7]reduce count by one

// pull output high
res 0,a //[4] clear bit 0 (Drive) to 0 (low) to turn off the pull down
ld (PCDRShadow),a //[10] put results back in shadow memory
ioi ld (PCDR),a //[2+9] store the result back to physical mem

this delay is 480uS but I should pre load bc with approx. 0x1001 which is closer to 650us