Whenever we present our work on evaluating users’ quality of experience (QoE) with online streamed video (like watching YouTube), people ask us, “So, how would this work in a real-world system?”
We’ve come up with some neat algorithms for determining how a user would evaluate video quality, based solely on measurements we can easily obtain like bandwidth and received packets, but our work has largely remained in the proof-of-concept, measurement and analysis tool realm up to this point. As a former engineer, though, I’m interested in building things. My vision for this project has always been to build a functioning system, one that can predict and evaluate video QoE in real time. So a couple of years ago, we set out to answer some of those questions.
This week, I’ll be presenting the first results of that study, “Systems Considerations in Real Time Video QoE Assessment”, at Globecom—specifically, at the Workshop on Quality of Experience for Multimedia Communications. In this study, we attempted to answer the following questions: How frequently can we generate video QoE ratings with some degree of accuracy? How often should we sample the measurement data? How do we weigh the need to consolidate data collection (arguing for fewer, less frequent data points) with the need to monitor video quality in real time (arguing for more frequent data points)? What are the timing requirements for such a system, both in training the system and in assigning ratings to videos?
To answer these questions, we used the data we collected in the summer of 2010 for this paper, developed a mechanism to play back the data in pseudo-real time, and then sliced and diced the data in various ways. We played with the sampling rate: how many seconds should pass between measurements? We played with the amount of data to process at once: would ten seconds be enough to give us an accurate video QoE rating? would a minute be too long? Which video measurements should we use: all of them, some of them, one of them at a time? While trying all of these combinations, we kept our eye on the clock, literally: if this system is going to deliver results in real time, then we need to make sure that “teaching” the system how to evaluate videos does not take too long—otherwise, our system is not very adaptable, and thus not very useful.
In the middle of the study, we realized that our assumptions about the video delivery system itself might also impact how the system is designed. For instance, our system could look like Netflix: Netflix controls which videos are available to viewers, and the content is fairly stable (new videos are added on predictable schedules). This looks very different from something like YouTube, where the available content changes rapidly because people are constantly uploading videos. In the former case, we can “teach” our system using videos that people will be watching. In the latter case, there are no guarantees that we have videos to teach our system that look anything like what people will be watching. So we considered both of these scenarios as well.
So what did we learn from this study?
- Taking data samples frequently and processing smaller amounts yields the best results, most of the time. Except for the shortest video in our study (a 2 minute clip of dialog from a movie), we were best able to predict video quality by sampling data every second and evaluating the data in 20-second chunks. (For the shortest video, going 50-60 seconds between evaluations worked better. We think this is because the scene changes in this clip happened about every 40-50 seconds.)
- More is not better, when it comes to what type of data to use. While we had 4 different categories of data available from the videos—bandwidth, frame rate, received packets, and the number of times the clip buffered—we found that concentrating on just bandwidth and received packets gave us the most accurate picture (no pun intended) of video quality, in general. Again, our shortest video was an outlier: here, frame rate did a better job of judging video quality.
- The system can operate in real time, because video ratings assessment happens in less than a second, and at most it takes 10 minutes to train the system, which is done off-line anyway. The “less than 10 minutes to train” is key, because this means we can continually re-train our system as new videos come online, if we choose to do so.
Clearly this paper doesn’t definitively answer the question of how such a system would work, but it’s a step on the right path. We are actively considering some related questions, specifically what other infrastructure pieces would be required to support collecting, analyzing, processing, and feeding back such measurements into the system, so that the system could fix itself when video quality goes south. There’s also the question of how reliant our data is on the particular videos selected. We actually found support for “one configuration to rule them all”: a sampling rate, evaluation time, and set of measurements that worked universally well for all videos and for each of the video system scenarios, which is promising in terms of the generalizability of our solution, but further study with additional videos would definitely help.
Acknowledgements: Two of my research students, Tung Phan ’13 and Robert Guo ’13, did the initial studies and analysis of the data, in 2011. At that point, we actually got stuck and put the project aside for a bit. The insights we gained from what didn’t work informed the approach in this paper, and definitely made this paper possible! The infrastructure for collecting and analyzing the data, and the data used in the paper, came out of work in the summer of 2010 by Guo, Anya Johnson ’12, Andy Bouchard ’12, and Sara Cantor ’11.