The official notification from Youtube :
We use a combination of video characteristics such as color, spatial layout and motion to estimate a depth map for each frame of a monoscopic video sequence. We use machine learning from the growing number of true 3D videos on YouTube to learn video depth characteristics and apply them in depth estimation. The generated depth map and the original monoscopic frame create a stereo 3D left-right pair, that a stereo display system needs to display a video as 3D
For more information on the new 3D technology being implemented by Google, jump over to the YouTube blogpost for more information."