WorldVu

The title of our project is “WorldVu”, which is a 360 degree frame processing accelerator pipeline.

Team Members: Sampath Chanda, Harish Dattatraya Dixit.


For Project main page (all information including proposal, checkpoint, etc.,), refer to: Project Home

Summary

Our project goal is to accelerate the (360 degree) frame processing using source video feed from Facebook Surround 360 and also referencing the correctness of the code with the Facebook implementation. We intend to accelerate the processing using GPUs as part of the effort.

Background

Our project involves the acceleration of the 360 degree frame processing application. It takes the source set of frames from Facebook’s Surround360 Cameras, which involves a rig of 14 cameras taking images in different directions, and finally constructing the complete image from each of the 14 cameras and preserving the quality. The current state of the art process implemented by Facebook takes about (45 seconds as per info from Alex) to implement this. We intend to accelerate the same.We would also be using the current implementation of the same algorithm by Facebook as the starting point. We intend to further accelerate the process by attempting paralellism across images/ redundancy between data from images and also utilsing the effective system architecture of the GPU in obtaining the best possible performance.

The Challenge

The problem poses many challenges in accelerating the implementation. The sheer computation invloved in ensuring the generation of 360 degree frames would need the processing of the video from 14 high resolution cameras. This would also mean that the memories storing the images would be completely stressed due to the size of the image from one individual camera not fitting in cache. Also cache thrashing may occur due to the fact that each image may be contending for the same cache line based on what stage of the computation we are in. Also the amount of time spent in processing needs to be carefully tuned. Given that there are 14 cameras, there will be redundancy in images for sure, but identifying and tuning the performance of the GPU for these redundancies ( to avoid repeated computations of no value ) is also a hard problem as workload characterisation cannot be done on just a single sample of data/image frame. So an efficient pipeline considering all these architectural disciplines would be a challenging as well as a rewarding project utilizing variety of aspects that were taught during the course.

Resources

We plan to utilize the GPUs present in the Gates Cluster for our development since we anticipate our code to have longer runtimes and the iterations also to be on the higher side. We would be starting from the Facebook code base found at : https://github.com/facebook/Surround360. We would also be referring the documentation and overall implementation of the scanner from Alex Poms (https://github.com/scanner-research/Surround360) as well as the draft documentation from Facebook describing challenges of the same. We would also be using CUDA documentations for our coding references and looking into OpenCV and Vision Pipeline accelerator related papers to aid our efforts in designing the best possible accelerator for the said problem.

Goals and Deliverables

We plan to achieve a significant performance improvement (at least a 3X) over the current Facebook Surround360 pipeline in the order. Our stretch goal is to be able to bring down the frame processing time to 1 second from 45 seconds. Being able to process videos of duration (1 min) currently takes around 1350 minutes. Reducing the processing time from 45 seconds to 15 seconds will reduce the 1 min video processing duration to 450 minutes. So this is the achievable goal that we are setting ourselves with. Our stretch goal is to bring down the processing from 45 seconds to 1 second and reduce the processing time from 1350 minutes to 30 minutes to process the video.

Demo: In the demo, we would be comparing our implementation with the implementation from Facebook and perform a run-time comparision of both of the frame processing pipelines.Specifically, we would be taking a 360 degree frame seqeunce and feed it to both the pipelines and time the output generation sequence. In addition, we would also be generating some graphs to indicate the scalability and the performance bottlenecks identified in the implementation.

Platform Choice

We would be implementing our algorithms for acceleration on GPUs. Given that the input consists of a number of frames and GPUs are predominantly good at utilizing data parallelism at an impressive pace, we would be using the GPU platform for our implementation. We do notice that to be able to perform a fair comparision, we would have to relatively estimate an improved reference scale against our implementation for which we will be replicating a naive kernel that translates Facebook source code into CUDA and executes it. We will be improving the code with our novel algorithms and optimizations.

Schedule

  1. Get the Surround360 pipeline to compile (We will be using this repo to setup the dependencies. https://github.com/holgr/build_surround360)
  2. Get the pipeline running over the 2 frame sample dataset that is available.
  3. Do a deep dive into the performance of the pipeline, determining timing and identify performance bottlenecks.
  4. Perform small experiments that might improve performance in order to verify our hypothesis about the bottlenecks.
  5. Fully implement the intended optimizations (for example, making better use of cache locality/using GPUs)