Learning Based Image and Video Editing
xmlui.mirage2.itemSummaryView.MetaDataShow full item record
Image and video editing encompasses the wide range of image operations to give desired visual effects to a given image or video either for improving various visual properties such as color, contrast, luminance or better emphasizing some aspects of scenes such as an objects, background, activity, attribute, emotion etc. Popular graphical tools(e.g. Adobe Photoshop, GIMP) that provide rich image operations can be utilized to achieve the desired visual effects, however users need to be familiar with image processing methods and have skills to overcome challenging low-level operations on images and videos. Therefore, easy and efficient image and video editing methods are needed for casual users to manipulate visual contents with high-level interactions such as natural languages. On the other hand, it is expected that the processes will be imperceptibly flawless on the image, in other words, photorealism should not be degraded. Recently, data-driven or learning based new works which try to meet those expectations have been proposed for various image and video editing problems. In this thesis, we propose learning-based methods for a number of image and video editing problems which are alpha matting, visual attribute manipulation and language-based video manipulation following recent trends and developments. Our methods produce competitive or better results against state-of-the-art methods on benchmark datasets quantitatively and qualitatively while providing simple high-level interactions such as natural language and visual attributes. Besides, our visual attribute manipulation method is the first high-level photo editing approach to enable continuous control on transient attributes of natural landscapes in the literature. For alpha matting, we present a new sampling-based alpha matting approach for the accurate estimation of foreground and background layers of an image. Previous sampling-based methods typically rely on certain heuristics in collecting representative samples from known regions, and thus their performance deteriorates if the underlying assumptions are not satisfied. To alleviate this, we take an entirely new approach and formulate sampling as a sparse subset selection problem where we propose to pick a small set of candidate samples that best explains the unknown pixels. Moreover, we describe a new dissimilarity measure for comparing two samples which is based on KL-divergence between the distributions of features extracted in the vicinity of the samples. The proposed framework is general and could be easily extended to video matting by additionally taking temporal information into account in the sampling process. Evaluation on standard benchmark datasets for image and video matting demonstrates that our approach provides more competitive results compared to the state-of-the-art methods. For visual attribute manipulation, we explore building a two-stage framework for enabling users to directly manipulate high-level attributes of a natural scene. The key to our approach is a deep generative network which can hallucinate images of a scene as if they were taken at a different season (e.g. during winter), weather condition (e.g. in a cloudy day) or time of the day (e.g. at sunset). Once the scene is hallucinated with the given attributes, the corresponding look is then transferred to the input image while preserving the semantic details intact, giving a photo-realistic manipulation result. As the proposed framework hallucinates what the scene will look like, it does not require any reference style image as commonly utilized in most of the appearance or style transfer approaches. Moreover, it allows to simultaneously manipulate a given scene according to a diverse set of transient attributes within a single model, eliminating the need of training multiple networks per each translation task. Our comprehensive set of qualitative and quantitative results demonstrate the effectiveness of our approach against the competing methods. In our last work, we introduce a new task of manipulating person videos with natural language, which aims to perform local and semantic edits on a video clip of an individual to automatically change their outfit based on a description of target look. To this end, we first collect a new video dataset containing full-body images of different persons wearing different types of clothes and their textual descriptions. The nature of our problem allows for better utilization of multi-view information and we exploit this property and design a new language-guided video editing model. Our architecture is composed of two subnetworks trained simultaneously: a network for constructing a concise representation of the person from multiple observations (representation network), and another network that benefits from the extracted internal representation for performing the manipulation according to the target description (translation network). Our qualitative and quantitative evaluations demonstrate that our proposed approach significantly outperforms existing frame-wise methods, producing temporally coherent and semantically more meaningful results.