"The antiquity and ubiquity of pictures suggests that some form of 'visual literacy'--the ability to 'read' and understand pictures (e.g. Messaris, 1994)--is deeply embedded in the human mind, even the genome." (Cutting, 1998)
I, personally, have a strong disposition towards imagining that visual/structural interfaces are intuitive. But this, I believe, is only true on the individual level. Such interfaces are likely less intuitive or fundamental that I am inclined to think. In much the same vein, I would guess that projects like those in Georgia Tech's Sonification Lab exist because the people running them experience the auditory world more keenly than other worlds.
Truth be told though, I have no evidence for this. But it still leads us to an interesting question--and a worthwhile one, if only to figure out if my guesses above are correct. That is, what input modalities really are natural for people? Is that even a valid question? I imagine an ecological psychologist would say that behavior, perception and cognition are inextricably tied through all input modalities to the environment.