It will depend on how your game is programmed, but typically it will never be more than one frame (= 1/(60Hz) = 16.666... milliseconds).
In more detail: When you read the controllers, you get the exact state of the buttons at the time, and when you write the sound registers, the writes take effect immediately. So it depends on how long of a delay there's between reading the controller and processing the sound, both of which usually happen within a single frame, unless the game's lagging.
That's the typical case. If you want, you can process sound right after reading the controller, in which case the delay could be only some tens or hundreds of CPU cycles (one CPU cycle is 559 nanoseconds on NTSC NES).
IOW: It's really never anything you have to worry about.