Well, what do you consider to be "up," and "down?" Pull button in front/behind of you seems awfully similar to the function of skip forward or backwards, in that you also pull away from/towards you. If they are on the side of the head, does that mean that it is up when the user wears them (which would toggle left and right when the button is facing the user)? Or does it mean, forward and backward when on the head but, up and down when the button is facing the user? I find that the end user experience often isn't a consideration in some user inputs, and sometimes it feels as if a product wasn't even tested before release. I think an up and down pull (Left/right viewed from the button input viewing scheme) for volume, and forward/backward (up/down from input view), would be most intuitive.
Added to the question of why the longer button press to turn it off? It seems to me that I waste so much of my life pressing and holding buttons when a single 1-2 second hold should be long enough to signal a serious request for power on/off