A major problem in crowd counting is the varying size of human heads in an image due to the changeable camera viewpoint, perspective and imaging resolution. To resolve this problem, we propose scale-adaptive CNN-based framework (SA-CNN) which implicitly outputs features at multiple scales and adaptively estimates density maps from the fusion of these features. To overcome the fact that head size is not annotated in all the public crowd datasets, we design a dynamic supervision strategy to reach the goal. Experiment results show that our method achieve state-of-the-art performance while demonstrating significant robustness to hyper-parameters even the choice of backbone.