endasc2, your focus is correct. Noise is the overriding issue. You might want to check out Amarra and try to get as much uncompressed music as possible (ideally with a higher bit rate and sampling rate). My additional recommendation is to get the best tube amp you can afford, great wires, and after doing research get yourself the best tubes possible. Match that to the best headphones you see hear at Massdrop (avoid the boom factories - go for accurate). That will give you incredibly good sounds that is within reach.
Thirty years ago, the most popular audio magazine what Stereo Review. Their lead reviewer, Julian Hersch, constantly argued that you couldn't hear the difference between amplifiers with distortion below a certain point. This claim from a man who had failed hearing. I'm convinced people who claim you can't hear the difference between music sampled at higher and lower sampling rates and frequencies are either relying on what they read (not listening), don't have a trained ear (e.g. aren't a musician or sound engineer), don't know what to listen for, haven't tried, or simply can't hear (as with Julian Hersch).
The scientist often quote the Nyquist-Shannon sampling theorem that establishes that in order to reproduce a frequency, you must have a sampling frequency twice that. So, if you want to reproduce 20,000 hertz, you need a sampling frequency of 40kHz or greater (20k being the upper end of what humans can hear - at birth). So, CDs sample at 44.1kHz. This is correct and also a very limited perspective (both in terms of digital audio engineering and in terms of sound). With digital audio, there is something called sideband noise that extends 20kHz above and below the sampling frequency. One problem with 44.1k sampling is that the sideband noise is so close to the music (half a note away) that it requires brick wall filters or oversampling. Sampling at 96k avoids that digital sideband noise issue completely.
When CDs first came out in the 1980s, Stereo Review declared that CDs meant that we would now have "Perfect Sound Forever." After all, the numbers don't lie - or do they? On CDs, distortion was vanishingly low, they provide great dynamic range, and they didn't wear out. At that time, audiophiles were saying that while incredibly clean, CDs sounded terrible (as much an exaggeration as "Perfect Sound Forever"). What audiophiles were hearing (yes, HEARING) was the artifacts of far-from-perfect digital audio. The problems included harsh upper register, loss of soundstage, the inability to resolve differences in timbre, and a significant problem with listener fatigue. The problem was both on the production side and on the disks. So, this issue of what we can and cannot hear has been with us from the start of digital audio, but numbers are not ears. What is often overlooked is that we are only beginning to understand the brain, and what can be perceived by the senses, and how.
[As a sidebar, I can recall attending a meeting at Sony where their engineers claimed that their audio compression could resolve lossless 16-bit resolution in one-fifth the space of a CD (this was in the middle 1990s). On five consecutive double-blind tests I was able to identify the compressed and uncompressed sample every time. They asked me how I was doing that. Compression is a different topic, but I'm making a point.]
At a very basic level, you need to think about what 16/44.1 or 24/96 mean. What is a sample? A sample is like a "picture" reconstruction of the the audio sine wave. A 16-bit sample can resolve 2 to the 16th power or 65,536 values per sample, while a 24-bit sample can resolve 2 to the 24th power or 16.78 million discrete values per sample. Simply put, the higher the sampling rate, the more information can be stored and the detail that can be resolved. A sampling rate of 44.1 kHz means that there are 44,100 samples per second. That sounds like a lot, but it is not (it's barely enough). Likewise, a sampling rate of 96k means that there are 96,000 samples per second (it also means more dynamic range, but that we rally can't hear because 144 db would make us deaf). The key thing with a 96k sampling frequency is that there is less than half the space between samples, bringing the sound closer to a continuum (closer to analog), with greatly reduced stair-stepping in the raw analog output, thereby reducing or eliminating the need for filtering of that noise. So, what you are comparing is a system that can resolve 65, 536 discrete values per sample and doing that 44,100 times a second to a system that can resolve 16.78 million discrete values per sample and doing that 96,000 times a second. The latter is better.
Can you hear the difference? I don't know, Can you hear differences in the decay of instruments or recognize the accuracy of their timbre, or soundstage (side-to-side, front-to-back, and top-to-bottom), or the placement of instruments in that space, or the sound of the venue based on direct, reflected, and reverberant sound, or with more processed music are you able to hear the sound of voice layering, limiters, gates, pitch correctors, and the various other effect processors. These are some of the things that are harder to resolve and more easily masked by noise or lost while encoding. They are also the things that I can hear on high bit rate, high-resolution systems, in a low-noise environment (especially in an anechoic chamber), and being played back on a system with sufficient capability. The nice thing about headphones is that for less than $1000 you can get sound quality that would take upwards of $20k to get in a high-end stereo system. However, there is a big missing. Headphones don't reproduce a massive soundstage. However, the percentage of people who can afford or are willing to spend north of $20k or much more for an two speaker audio system is vanishingly small.
When thinking about audio, what you can perceive and what you can't, I believe it helps to think about video. For some reason, people are much more confident in their ability to see. Think about video. Can you tell the difference between standard definition and HD, or between HD and UHD? People will tell you that you can't tell the difference between HD and UHD on a display that is 50-inches or smaller (and of course that would be affected by the distance to the screen). Usually, this is a myopic argument focusing too narrowly on resolution as a measurement of pixels.
Pixels are only one aspect of resolution, and it is probably what is least important. What about color gamut? What about contrast ratio? Here it becomes very hard for the consumer because the specifications claimed by TV manufacturers are total BS (Barbara Streisand).
[Sidebar: Samsung's claims on their mega expensive UHG LED/LCD TVs are absurd in the extreme and that TV is way overpriced Don't buy it or any curved LED/LCD TV because of horrible off-axis performance and light bleed-through. Curved OLED is fine. If you want a great TV, get one of LG's UHG OLED TVs. They are the best TV in the world by far and their patents make that not likely to change for a while (unless others buy the panels from LG and then brand them).]
Getting back to my point...
Here is how we see. Think about seeing a ship on the ocean way out on the horizon. What you perceive first is contract. You might only barely be able to make it out, and you see it in black-and-white. You see contrast. Then you see color, and then you see resolution. All are important and all are aspects of a display's ability to resolve a picture (motion is another). Masterpiece paintings aren't necessarily great because of their detail, often not. While UHG provides four times the resolution of HD, but it also provides a much wider color gamut (more colors) and more contrast. Audio works in a similar way. Realistic music reproduction is a function of three fundamentals; dynamics, soundstage, and timbre (harmonics). Dynamic range is the difference between soft and loud. It's like contrast ratio on video. If there is noise, it masks the softer sounds (no matter how much resolution there is), just as a TV with poor contrast masks video resolution. One problem with LED/LCD TVs that's not a problem on plasma or OLED is that you can see motion artifacts. The most obnoxious being the football on a long pass being choppy in its motion. So, what do they do to try to compensate? They up the frame rate. That's analogous to upping the sampling rate in digital audio. I could go on, but this is long enough. First, focus on getting rid of the noise. Then, work on getting the rest of the pieces in place (source, playback system, environment), and if you do it right - THEN you'll hear the difference. 100%! But only then.
All the best
Joe
The question then becomes - does the D20 offer something special in terms of audio quality to make it a good deal for the price? I'm not entirely convinced that it does just on the basis of one review.
And it would annoy me to own a DAC with a USB input that was so limited compared with what the DAC itself is capable of. If a DAC has a USB input, I want to be able to use it without having to worry about it's particular limitations.
BTW, I'm fairly skeptical of a lot of the audiophile nonsense that goes around, and I'm not convinced there's any benefit to higher than 16-bit/44kHz when it comes to playback. But given issues like digital volume control, use of software EQ and real-time mixing of multiple sources that tend to occur with computer based playback, I'd prefer not to have everything re-sampled back down to 16-bit just because of a USB input limitation.