Hyperthreading notwithstanding, one thread per cpu will give you the highest application performance. Having more than that means context switching and contention, which means the cpu is spending time on tasks other than running your program.
For this reason, the fastCGI approach to web application servers is definitely a good thing – prefork a fixed number of single-threaded processes to handle requests, one per cpu.
So why spawn more threads or processes than cpus? Blocking. If a thread blocks, its out of action until the thing its waiting for is ready. With 1 thread per cpu, if a thread ever blocks, thats a whole cpu idle. To implement an application that never blocks, especially when it needs to make network calls (eg. to a database), or read files, is more challenging than doing the same thing with blocking calls and many threads.