There was a translator present - just not Translator-san (who I'm told has broken his ankle - get well soon Translator-san!) - and Yamauchi rarely speaks in English at all, let alone full interviews.
However, invoking "lost in translation", I kind of understand what he's getting at, even if I don't necessarily agree.
"Sound" is a complex field - it's not as simple as recording a noise and playing it back. A really basic example is how you sound differently on a tape recorder than you do in your head. While I'm sure that how they record sounds isn't necessarily how I'd do it (doing it on a dyno is sensible - loading the engine with no road noise), the simple fact is that you will not be able to tell the difference on an equaliser between the real car and the GT5 one. It'll be the same pitch (frequency) spectrum at, if you choose, the same volume. This is what I suspect Kazunori means by "too real" - 1:1 on the equaliser.
The perplexing thing is that some cars were recorded on a dyno, others on track, and others just parked up somewhere. But they all have the same drawbacks, consistent with the lowest-common-denominator: parked up.
This isn't the problem with GT5's sounds. The problem is what he refers to as "sexier" sounds - or what musicians will know as timbre. If you play two musical instruments at the same pitch and same volume an equaliser will show no difference - but they're different, aren't they? You know how you can tell between a synthesiser version of an instrument and a real instrument - or a human voice and autotune? This is due to timbre - timbre is what gives "sexiness" to sound. You can even tell between two identical instruments played entirely in synch with each other due to timbre...
What constitutes timbre is tough to pin down - it's essentially every characteristic of a sound that isn't the frequency or volume It's often referred to as "sound colour" and you'll hear terms bandied about like ADSR (attack [the start of a note], decay [normalisation from the attack to the sustain], sustain [the intended note], release [return to zero]), but it's really tough to explain and even tougher to compress and shove onto a CD/DVD/BD, uncompress, dynamically simulate and allow space for a game.
Timbre or "colour" is the spectral signature of the sound, and as such is dependent on pitch and volume, and also forms part of how the properties of the sound change over time. You can think of colour as being an analogy of "
key colour" applied to musical instruments, and as such is an entirely emotional concept in origin. Nowadays, sound is usually split into several other components that make discussion and objective measuring (and thus reproduction) a bit easier: pitch, harmonics, formants and volume(s); and these are considered both statically and "dynamically" - that's not to be confused with "dynamics" in the musical sense, i.e. changes in volume alone, but rather how they each change with time.
Reproducing the static timbre is easy, as is the pitch, the samples do that already. What is hard, as you rightly say, are the dynamic changes in these things - any discussion on vocal synthesis will quickly tell you that much. Having tried synthesising engine sounds myself, I can tell you that you can use several different methods to get perfectly accurate (but static) engine sounds, but it's the way they change according to the inputs that is the real challenge. The ADSR stuff, whilst really an artificial construct useful in synthetic instruments, is primarily a dynamic consideration, and can be (and is) used for all of the properties of sound: fundamental pitch, harmonic structure, formants, volume etc. This sort of variation is impossible to produce a sample set for, as you say.
What many games do to substitute for timbre is make it louder and add more bass - because we associate noise, particularly bassy noise, with feeling. If you can feel the noise move through you it feels more "real" (and at real race tracks, sound hits you like a wall). Shoved through a set of TV speakers, it sounds "better" than the quieter and more accurate (in terms of the equaliser) note. GT games don't do this (with the exception of GTHD) and so, through TV speakers, they sound like ass because there is neither real feeling nor fake feeling - just the frequencies and volume. They sound better if you have speakers with better range and quicker reactions or if you have a good amp to dig the sounds out (on the pair of monitor speakers I usually use for gaming, GT5 sounds fine, if a little vague sometimes. Good enough at least that my wife can hear I'm driving a V8, three walls and a floor away) but the lack of timbre or a substitute for it prevents the realism.
To me, this is nothing to do with timbre, at least not the measurable type. It is purely psychoacoustics. Loud sounds are felt as well as heard, and if you don't have that, it won't feel right, absolutely. Additionally, however, games neglect spatial colouration on sounds massively, because it's a spectral thing - different frequencies are affected differently.
If you consider the problem of global illumination and then realise that the length scales involved with sounds mean you need to do that sort of thing for every frequency (or some constrained set of frequency bands) you can begin to see the scale of this particular issue. Nevertheless, adding convincing spatialisation (due to both
source and
environment, both of which are static in recordings) is key to making the sounds more real, much more so than fake distortion and bass-boosting. The updated external sounds on
iRacing's V8 Supercars are proof of that. All that changed was the samples include a bit of comb filtering naturally present from a distant recording, and it really makes the sound that much more real (
compare the original sounds). I personally think these effects should be added dynamically, but I haven't personally found a reliable way to do that yet (but I know it's possible).
Add to that better simulation of the way engines actually work, and you immediately get better control of the dynamic aspects of the engine (i.e. "ADSR") - again, see
iRacing, which recently added drivetrain flex to its simulation and got "gear wobble" etc. effects
for free. The improvement (
compare) is spectacular.
There's certainly more they could do. Sound recording needs to be primarily in the driver's seat for cockpit, in the engine for nosecam, two feet behind and four feet above the car for coptercam (though winding in some essence of the other two for each will help add character). It needs to be pushed through a spectrometer rather than an equaliser before being passed as satisfactory. It needs to be optimised for different settings - the ghacky 2W TV speakers most people play through, a stereo system, a basic surround system (2.1), a middle surround system (5.1), a geeky surround system (7.1) and a full cinema system (what's this up to now? 14.2?).
Or they could make it louder and add bass.
Sound recording for making of samples needs to be as clean as possible, capturing all the sources externally (so, intake, exhaust etc.) and a single set of directional recordings from the interior. They then need to collect recordings, preferably with video relating the position and angle of the car relative to the listener, of the car being used, preferably in an open area, to get an idea of how those clean sources translate into the spatialised sounds we need to be hearing.
That's the source aspect of spatialisation, which varies from car to car based on its shape, size, source placement, shape and size etc. and should be reproduced on the fly by colouring the clean recordings appropriately. The environment aspect is easily approximated using reverbs and proper directional source mixing, although there's some complexity with reverb and direct path volume scaling (wet / dry ratio), as well as directional reverbs.
The hardware for a spectrometer and an equaliser are very similar - in the former, you're only interested in measuring the relative "volumes" of the frequency ranges, whereas the EQ is designed to scale them relatively. As such, two musical instruments do look different on an EQ (assuming it has a pre-vis), in exactly the same way they look different on a spectrometer - their different timbres are apparent in their spectra. What is different is that a spectrometer tends to have more frequency bands, but that's not a defining feature in general, since you can get some very finely divided EQs on studio hardware and especially software.
The sounds do need to have different mixing for differnet hardware - GT5 suffers in that it has realistic dynamic ranges, coupled with a dynamic range (volume) compressor, that really doesn't work on hardware with poor dynamic range. What it needs is a dynamic source-volume solution to allow the dynamic range to fit in a given "window", rather than cap the volume, which just means quieter concurrent sounds get drowned out.
Take "real" as "faithful frequency and amplitude reproduction" and take "sexier" as "better timbre or bassier/noisier" and the response makes sense.
As I've already stated, I think he's talking about the sterility, much as he was with the environment visuals. How everything looked "perfect", because it was captured at one time of day and everything was so clean and unchanging. The sounds are similar, they are clean (close, isolated recordings) and totally static in terms of spatialisation, except for a bit of clever noise generation and filtering on the exhaust sounds - a clue to PD's intended direction.
Since the solution to the sterile visuals was to include dirt in all the right places, and extra detail, too, plus to add a dynamic aspect to them in the form of weather and day / night transitions (here and there...), I suspect the same is going to happen to the sounds. That is, they'll have imperfections (colouration) dynamically applied to them according to the source conditions and the environment.
Dynamic spatialisation is sexy.