{"title": "Deep Mean-Shift Priors for Image Restoration", "book": "Advances in Neural Information Processing Systems", "page_first": 763, "page_last": 772, "abstract": "In this paper we introduce a natural image prior that directly represents a Gaussian-smoothed version of the natural image distribution. We include our prior in a formulation of image restoration as a Bayes estimator that also allows us to solve noise-blind image restoration problems. We show that the gradient of our prior corresponds to the mean-shift vector on the natural image distribution. In addition, we learn the mean-shift vector field using denoising autoencoders, and use it in a gradient descent approach to perform Bayes risk minimization. We demonstrate competitive results for noise-blind deblurring, super-resolution, and demosaicing.", "full_text": "Deep Mean-Shift Priors for Image Restoration\n\nSiavash A. Bigdeli\nUniversity of Bern\n\nbigdeli@inf.unibe.ch\n\nMeiguang Jin\n\nUniversity of Bern\njin@inf.unibe.ch\n\nPaolo Favaro\n\nUniversity of Bern\n\nfavaro@inf.unibe.ch\n\nUniversity of Bern, and University of Maryland, College Park\n\nMatthias Zwicker\n\nzwicker@cs.umd.edu\n\nAbstract\n\nIn this paper we introduce a natural image prior that directly represents a Gaussian-\nsmoothed version of the natural image distribution. We include our prior in a\nformulation of image restoration as a Bayes estimator that also allows us to solve\nnoise-blind image restoration problems. We show that the gradient of our prior\ncorresponds to the mean-shift vector on the natural image distribution. In addition,\nwe learn the mean-shift vector \ufb01eld using denoising autoencoders, and use it in a\ngradient descent approach to perform Bayes risk minimization. We demonstrate\ncompetitive results for noise-blind deblurring, super-resolution, and demosaicing.\n\n1\n\nIntroduction\n\nImage restoration tasks, such as deblurring and denoising, are ill-posed problems, whose solution\nrequires effective image priors. In the last decades, several natural image priors have been proposed,\nincluding total variation [29], gradient sparsity priors [12], models based on image patches [5], and\nGaussian mixtures of local \ufb01lters [25], just to name a few of the most successful ideas. See Figure 1\nfor a visual comparison of some popular priors. More recently, deep learning techniques have been\nused to construct generic image priors.\nHere, we propose an image prior that is directly based on an estimate of the natural image probability\ndistribution. Although this seems like the most intuitive and straightforward idea to formulate a prior,\nonly few previous techniques have taken this route [20]. Instead, most priors are built on intuition or\nstatistics of natural images (e.g., sparse gradients). Most previous deep learning priors are derived\nin the context of speci\ufb01c algorithms to solve the restoration problem, but it is not clear how these\npriors relate to the probability distribution of natural images. In contrast, our prior directly represents\nthe natural image distribution smoothed with a Gaussian kernel, an approximation similar to using\na Gaussian kernel density estimate. Note that we cannot hope to use the true image probability\ndistribution itself as our prior, since we only have a \ufb01nite set of samples from this distribution. We\nshow a visual comparison in Figure 1, where our prior is able to capture the structure of the underlying\nimage, but others tend to simplify the texture to straight lines and sharp edges.\nWe formulate image restoration as a Bayes estimator, and de\ufb01ne a utility function that includes the\nsmoothed natural image distribution. We approximate the estimator with a bound, and show that\nthe gradient of the bound includes the gradient of the logarithm of our prior, that is, the Gaussian\nsmoothed density. In addition, the gradient of the logarithm of the smoothed density is proportional\nto the mean-shift vector [8], and it has recently been shown that denoising autoencoders (DAEs) learn\nsuch a mean-shift vector \ufb01eld for a given set of data samples [1, 4]. Hence we call our prior a deep\nmean-shift prior, and our framework is an example of Bayesian inference using deep learning.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fInput\n\nOur prior\n\nBM3D [9]\n\nEPLL [41]\n\nFoE [28]\n\nSF [31]\n\nFigure 1: Visualization of image priors using the method by Shaham et al. [32]: Our deep mean-shift\nprior learns complex structures with different curvatures. Other priors prefer simpler structures like\nlines with small curvature or sharp corners.\n\nWe demonstrate image restoration using our prior for noise-blind deblurring, super-resolution, and\nimage demosaicing, where we solve Bayes estimation using a gradient descent approach. We achieve\nperformance that is competitive with the state of the art for these applications. In summary, the main\ncontributions of this paper are:\n\n\u2022 A formulation of image restoration as a Bayes estimator that leverages the Gaussian\nsmoothed density of natural images as its prior. In addition, the formulation allows us\nto solve noise-blind restoration problems.\n\u2022 An implementation of the prior, which we call deep mean-shift prior, that builds on denoising\nautoencoders (DAEs). We rely on the observation that DAEs learn a mean-shift vector \ufb01eld,\nwhich is proportional to the gradient of the logarithm of the prior.\n\u2022 Image restoration techniques based on gradient-descent risk minimization with competitive\n\nresults for noise-blind image deblurring, super-resolution, and demosaicing. 1\n\n2 Related Work\n\nImage Priors. A comprehensive review of previous image priors is outside the scope of this paper.\nInstead, we refer to the overview by Shaham et al. [32], where they propose a visualization technique\nto compare priors. Our approach is most related to techniques that leverage CNNs to learn image\npriors. These techniques build on the observation by Venkatakrishnan et al. [33] that many algorithms\nthat solve image restoration via MAP estimation only need the proximal operator of the regularization\nterm, which can be interpreted as a MAP denoiser [22]. Venkatakrishnan et al. [33] build on the\nADMM algorithm and propose to replace the proximal operator of the regularizer with a denoiser\nsuch as BM3D [9] or NLM [5]. Unsurprisingly, this inspired several researchers to learn the proximal\noperator using CNNs [6, 40, 35, 22]. Meinhardt et al. [22] consider various proximal algorithms\nincluding the proximal gradient method, ADMM, and the primal-dual hybrid gradient method, where\nin each case the proximal operator for the regularizer can be replaced by a neural network. They\nshow that no single method will produce systematically better results than the others.\nIn the proximal techniques the relation between the proximal operator of the regularizer and the\nnatural image probability distribution remains unclear. In contrast, we explicitly use the Gaussian-\nsmoothed natural image distribution as a prior, and we show that we can learn the gradient of its\nlogarithm using a denoising autoencoder.\nRomano et al. [27] designed a prior model that is also implemented by a denoiser, but that does not\nbuild on a proximal formulation such as ADMM. Interestingly, the gradient of their regularization\nterm boils down to the residual of the denoiser, that is, the difference between its input and output,\nwhich is the same as in our approach. However, their framework does not establish the connection\nbetween the prior and the natural image probability distribution, as we do. Finally, Bigdeli and\nZwicker [4] formulate an energy function, where they use a Denoising Autoencoder (DAE) network\nfor the prior, as in our approach, but they do not address the case of noise-blind restoration.\n\nNoise- and Kernel-Blind Deconvolution. Kernel-blind deconvolution has seen the most effort\nrecently, while we support the fully (noise and kernel) blind setting. Noise-blind deblurring is usually\n\n1The source code of the proposed method is available at https://github.com/siavashbigdeli/DMSP.\n\n2\n\n\fperformed by \ufb01rst estimating the noise level and then restoration with the estimated noise. Jin et\nal. [14] proposed a Bayes risk formulation that can perform deblurring by adaptively changing the\nregularization without the need of the noise variance estimate. Zhang et al. [37, 38] explored a\nspatially-adaptive sparse prior and scale-space formulation to handle noise- or kernel-blind deconvo-\nlution. These methods, however, are tailored speci\ufb01cally to image deconvolution. Also, they only\nhandle the noise- or kernel-blind case, but not fully blind.\n\n3 Bayesian Formulation\n\nWe assume a standard model for image degradation,\n\n(1)\nwhere \u03be is the unknown image, k is the blur kernel, n is zero-mean Gaussian noise with variance \u03c32\nn,\nand y is the observed degraded image. We restore an estimate x of the unknown image by de\ufb01ning\nand maximizing an objective consisting of a data term and an image likelihood,\n\ny = k \u2217 \u03be + n, n \u223c N (0, \u03c32\nn),\n\nargmax\n\n\u03a6(x) = data(x) + prior(x).\n\nx\n\n(2)\n\nOur core contribution is to construct a prior that corresponds to the logarithm of the Gaussian-\nsmoothed probability distribution of natural images. We will optimize the objective using gradient\ndescent, and leverage the fact that we can learn the gradient of the prior using a denoising autoencoder\n(DAE). We next describe how we de\ufb01ne our objective by formulating a Bayes estimator in Section 3.1,\nthen explain how we leverage DAEs to obtain the gradient of our prior in Section 3.2, describe our\ngradient descent approach in Section 3.3, and \ufb01nally our image restoration applications in Section 4.\n\n3.1 De\ufb01ning the Objective via a Bayes Estimator\n\nA typical approach to solve the restoration problem is via a maximum a posteriori (MAP) estimate,\nwhere one considers the posterior distribution of the restored image p(x|y) \u221d p(y|x)p(x), derives an\nobjective consisting of a sum of data and prior terms by taking the logarithm of the posterior, and\nmaximizes it (minimizes the negative log-posterior, respectively). Instead, we will compute a Bayes\nestimator x for the restoration problem by maximizing the posterior expectation of a utility function,\n\nE\u02dcx[G(\u02dcx, x)] =\n\nG(\u02dcx, x)p(y|\u02dcx)p(\u02dcx)d\u02dcx\n\n(3)\n\nwhere G denotes the utility function (e.g., a Gaussian), which encourages its two arguments to be\nsimilar. This is a generalization of MAP, where the utility is a Dirac impulse.\nIdeally, we would like to use the true data distribution as the prior p(\u02dcx). But we only have data\nsamples, hence we cannot learn this exactly. Therefore, we introduce a smoothed data distribution\n\np(cid:48)(x) = E\u03b7[p(x + \u03b7)] =\n\ng\u03c3(\u03b7)p(x + \u03b7)d\u03b7,\n\n(4)\n\nwhere \u03b7 has a Gaussian distribution with zero-mean and variance \u03c32, which is represented by the\nsmoothing kernel g\u03c3. The key idea here is that it is possible to estimate the smoothed distribution\np(cid:48)(x) or its gradient from sample data. In particular, we will need the gradient of its logarithm, which\nwe will learn using denoising autoencoders (DAEs). We now de\ufb01ne our utility function as\n\nG(\u02dcx, x) = g\u03c3(\u02dcx \u2212 x)\n\np(cid:48)(x)\np(\u02dcx)\n\n.\n\n(5)\n\nwhere we use the same Gaussian function g\u03c3 with standard deviation \u03c3 as introduced for the smoothed\ndistribution p(cid:48). This penalizes the estimate x if the latent parameter \u02dcx is far from it. In addition, the\nterm p(cid:48)(x)/p(\u02dcx) penalizes the estimate if its smoothed density is lower than the true density of the\nlatent parameter. Unlike the utility in Jin et al. [14], this approach will allow us to express the prior\ndirectly using the smoothed distribution p(cid:48).\nBy inserting our utility function into the posterior expected utility in Equation (3) we obtain\n\n(cid:90)\n\n(cid:90)\n\nE\u02dcx[G(\u02dcx, x)] =\n\ng\u03c3(\u03b7)p(x + \u03b7)d\u03b7d\u0001,\n\n(6)\n\n(cid:90)\n\n(cid:90)\n\ng\u03c3(\u0001)p(y|x + \u0001)\n\n3\n\n\fwhere the true density p(\u02dcx) canceled out, as desired, and we introduced the substitution \u0001 = \u02dcx \u2212 x.\nWe \ufb01nally formulate our objective by taking the logarithm of the expected utility in Equation (6),\nand introducing a lower bound that will allow us to split Equation (6) into a data term and an image\nlikelihood. By exploiting the concavity of the log function, we apply Jensen\u2019s inequality and get our\nobjective \u03a6(x) as\n\nlog E\u02dcx[G(\u02dcx, x)] = log\n\ng\u03c3(\u0001)p(y|x + \u0001)\n\ng\u03c3(\u03b7)p(x + \u03b7)d\u03b7d\u0001\n\n(cid:90)\n\n(cid:90)\n(cid:90)\n(cid:124)\n\n\u2265\n\n=\n\n(cid:34)\n\n(cid:123)(cid:122)\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\n(cid:125)\n\n(cid:124)\n\ng\u03c3(\u0001) log\n\np(y|x + \u0001)\n\ng\u03c3(\u03b7)p(x + \u03b7)d\u03b7\n\nd\u0001\n\ng\u03c3(\u0001) log p(y|x + \u0001)d\u0001\n\n+ log\n\ng\u03c3(\u03b7)p(x + \u03b7)d\u03b7\n\n= \u03a6(x).\n\n(7)\n\nData term data(x)\n\nImage likelihood prior(x)\n\n(cid:35)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nImage Likelihood. We denote the image likelihood as\n\nprior(x) = log\n\ng\u03c3(\u03b7)p(x + \u03b7)d\u03b7.\n\n(8)\n\n(cid:90)\n\nThe key observation here is that our prior expresses the image likelihood as the logarithm of the\nGaussian-smoothed true natural image distribution p(x), which is similar to a kernel density estimate.\n\nData Term. Given that the degradation noise is Gaussian, we see that [14]\n\ndata(x) =\n\ng\u03c3(\u0001) log p(y|x + \u0001)d\u0001 = \u2212|y \u2212 k \u2217 x|2\n\n\u2212 M\n\n\u03c32\n2\u03c32\nn\n\n2\u03c32\nn\n\n|k|2 \u2212 N log \u03c3n + const,\n\n(9)\n\nwhere M and N denote the number of pixels in x and y respectively. This will allow us to address\nnoise-blind problems as we will describe in detail in Section 4.\n\n3.2 Gradient of the Prior via Denoising Autoencoders (DAE)\n\n(cid:2)|x \u2212 r\u03c3(x + \u03b7)|2(cid:3) ,\n\nA key insight of our approach is that we can effectively learn the gradients of our prior in Equation (8)\nusing denoising autoencoders (DAEs). A DAE r\u03c3 is trained to minimize [34]\n\n(10)\nwhere the expectation is over all images x and Gaussian noise \u03b7 with variance \u03c32, and r\u03c3 indicates\nthat the DAE was trained with noise variance \u03c32. Note that this is the same loss as in non-parametric\nleast squares estimators [23, 26, 20]. Similar to Alain and Bengio [1], we parametrize this estimator\nusing neural networks for fast evaluation. They show that the output r\u03c3(x) of the optimal DAE (by\nassuming unlimited capacity) is related to the true data distribution p(x) as\n\nLDAE = E\u03b7,x\n\nr\u03c3(x) = x \u2212 E\u03b7 [p(x \u2212 \u03b7)\u03b7]\nE\u03b7 [p(x \u2212 \u03b7)]\n\n= x \u2212\n\n(cid:82) g\u03c3(\u03b7)p(x \u2212 \u03b7)\u03b7d\u03b7\n(cid:82) g\u03c3(\u03b7)p(x \u2212 \u03b7)d\u03b7\n\n(11)\n\n(cid:90)\n\nwhere the noise has a Gaussian distribution g\u03c3 with standard deviation \u03c3. This is simply a continuous\nformulation of mean-shift, and g\u03c3 corresponds to the smoothing kernel in our prior, Equation (8).\nTo obtain the relation between the DAE and the desired gradient of our prior, we \ufb01rst rewrite the\nnumerator in Equation (11) using the Gaussian derivative de\ufb01nition to remove \u03b7, that is\n\ng\u03c3(\u03b7)p(x \u2212 \u03b7)\u03b7d\u03b7 = \u2212\u03c32\n\n(12)\nwhere we used the Leibniz rule to interchange the \u2207 operator with the integral. Plugging this back\ninto Equation (11), we have\n\n\u2207g\u03c3(\u03b7)p(x \u2212 \u03b7)d\u03b7 = \u2212\u03c32\u2207\n\ng\u03c3(\u03b7)p(x \u2212 \u03b7)d\u03b7,\n\n(cid:90)\n\n(cid:90)\n\nr\u03c3(x) = x +\n\n(13)\nOne can now see that the DAE error, that is, the difference r\u03c3(x) \u2212 x between the output of the DAE\nand its input, is the gradient of the image likelihood in Equation (8). Hence, a main result of our\napproach is that we can write the gradient of our prior using the DAE error,\n\n= x + \u03c32\u2207 log\n\ng\u03c3(\u03b7)p(x \u2212 \u03b7)d\u03b7.\n\n\u03c32\u2207(cid:82) g\u03c3(\u03b7)p(x \u2212 \u03b7)d\u03b7\n(cid:82) g\u03c3(\u03b7)p(x \u2212 \u03b7)d\u03b7\n\n\u2207 prior(x) = \u2207 log\n\nr\u03c3(x) \u2212 x\n\n.\n\n(14)\n\n(cid:90)\n\n(cid:18)\n\n(cid:19)\n\n(cid:90)\n\ng\u03c3(\u03b7)p(x + \u03b7)d\u03b7 =\n\n1\n\u03c32\n\n4\n\n\f(cid:90)\n(cid:90)\n\n(cid:90)\n\n\u2265\n\n(cid:90)\n(cid:34)(cid:90)\n\n(cid:90)\n\nNB:\n\nNA:\n\nK T (Kxt\u22121 \u2212 y) \u2212 \u2207priors\n1. ut = 1\n\u03c32\nn\n1. ut = \u03bbtK T (Kxt\u22121 \u2212 y) \u2212 \u2207priors\n\n4. vt = \u03bbt(cid:2)xT (K t\u22121xt\u22121 \u2212 y) + M \u03c32kt\u22121(cid:3)\n\n3. xt = xt\u22121 + \u00afu\n3. xt = xt\u22121 + \u00afu\n6. kt = kt\u22121 + \u00afv\nTable 1: Gradient descent steps for non-blind (NB), noise-blind (NA), and kernel-blind (KE) image\ndeblurring. Kernel-blind deblurring involves the steps for (NA) and (KE) to update image and kernel.\n\nL(xt\u22121) 2. \u00afu = \u00b5\u00afu \u2212 \u03b1ut\n2. \u00afu = \u00b5\u00afu \u2212 \u03b1ut\nL(xt\u22121)\n5. \u00afv = \u00b5k \u00afv \u2212 \u03b1kvt\n\nKE:\n\n3.3 Stochastic Gradient Descent\n\nWe consider the optimization as minimization of the negative of our objective \u03a6(x) and refer to it as\ngradient descent. Similar to Bigdeli and Zwicker [4], we observed that the trained DAE is over\ufb01tted\nto noisy images. Because of the large gap in dimensionality between the embedding space and the\nnatural image manifold, the vast majority of training inputs (noisy images) for the DAE lie at a\ndistance very close to \u03c3 from the natural image manifold. Hence, the DAE cannot effectively learn\nmean-shift vectors for locations that are closer than \u03c3 to the natural image manifold. In other words,\nour DAE does not produce meaningful results for input images that do not exhibit noise close to the\nDAE training \u03c3.\nTo address this issue, we reformulate our prior to perform stochastic gradient descent steps that\ninclude noise sampling. We rewrite our prior from Equation (8) as\n\nprior(x) = log\n\ng\u03c3(\u03b7)p(x + \u03b7)d\u03b7\n\n= log\n\ng\u03c32(\u03b72)\n\ng\u03c31(\u03b71)p(x + \u03b71 + \u03b72)d\u03b71d\u03b72\n\ng\u03c32(\u03b72) log\n\ng\u03c31(\u03b71)p(x + \u03b71 + \u03b72)d\u03b71\n\nd\u03b72 = priorL(x),\n\n(cid:16)\n(cid:16)\n\n1 + \u03c32\n\nwhere \u03c32\n2 = \u03c32, we used the fact that two Gaussian convolutions are equivalent to a single\nconvolution with a Gaussian whose variance is the sum of the two, and we applied Jensen\u2019s inequality\nagain. This leads to a new lower bound for the prior, which we call priorL(x). Note that the bound\nproposed by Jin et al. [14] corresponds to the special case where \u03c31 = 0 and \u03c32 = \u03c3.\nWe address our DAE over\ufb01tting issue by using the new lower bound priorL(x) with \u03c31 = \u03c32 = \u03c3\u221a\nIts gradient is\n\n2\n\n.\n\n\u2207priorL(x) =\n\n2\n\u03c32\n\ng \u03c3\u221a\n2\n\n(\u03b72)\n\n(x + \u03b72) \u2212 (x + \u03b72)\n\nr \u03c3\u221a\n2\n\nd\u03b72.\n\n(18)\n\nIn practice, computing the integral over \u03b72 is not possible at runtime. Instead, we approximate the\nintegral with a single noise sample, which leads to the stochastic evaluation of the gradient as\n\n(cid:17)\n\n\u2207priors\n\nL(x) =\n\n2\n\u03c32\n\n(x + \u03b72) \u2212 x\n\nr \u03c3\u221a\n2\n\n,\n\n(19)\n\nwhere \u03b72 \u223c N (0, \u03c32\n2). This addresses the over\ufb01tting issue, since it means we add noise each time\nbefore we evaluate the DAE. Given the stochastically sampled gradient of the prior, we apply a\ngradient descent approach with momentum that consists of the following steps:\n\n1. ut = \u2212\u2207 data(xt\u22121) \u2212 \u2207 priors\n\nL(xt\u22121)\n\n2. \u00afu = \u00b5\u00afu \u2212 \u03b1ut\n\n3. xt = xt\u22121 + \u00afu\n\n(20)\n\nwhere ut is the update step for x at iteration t, \u00afu is the running step, and \u00b5 and \u03b1 are the momentum\nand step-size.\n\n4\n\nImage Restoration using the Deep Mean-Shift Prior\n\nWe next describe the detailed gradient descent steps, including the derivatives of the data term, for\ndifferent image restoration tasks. We provide a summary in Table 1. For brevity, we omit the role of\ndownsampling (required for super-resolution) and masking.\n\n5\n\n(15)\n\n(16)\n\n(17)\n\n(cid:35)\n\n(cid:17)\n\n\f\u03c3n:\n\nMethod\nFD [18]\nEPLL [41]\nRTF-6 [30]*\nCSF [31]\nDAEP [4]\nIRCNN [40]\nEPLL [41] + NE\nEPLL [41] + NA\nTV-L2 + NA\nGradNet 7S [14]\nOurs\nOurs + NA\n\n2.55\n30.03\n32.03\n32.36\n29.85\n32.64\n30.86\n31.86\n32.16\n31.05\n31.43\n29.68\n32.57\n\nLevin [19]\n7.65\n5.10\n27.32\n28.40\n28.31\n29.79\n21.43\n26.34\n28.13\n27.28\n28.30\n30.07\n28.83\n29.85\n28.28\n29.77\n30.25\n28.96\n28.03\n29.14\n27.55\n28.88\n28.95\n29.45\n29.00\n30.21\n\n10.2\n26.52\n27.20\n17.33\n26.70\n27.15\n28.05\n27.16\n27.85\n27.16\n26.96\n28.29\n28.23\n\n2.55\n24.44\n25.38\n25.70\n24.73\n25.42\n25.60\n25.36\n25.57\n24.61\n25.57\n25.69\n26.00\n\nBerkeley [2]\n7.65\n5.10\n22.64\n23.24\n22.54\n23.53\n19.83\n23.45\n23.61\n22.88\n22.78\n23.67\n23.42\n24.24\n22.55\n23.53\n23.90\n22.91\n22.90\n23.65\n23.46\n24.23\n23.60\n24.45\n24.47\n23.61\n\n10.2\n22.07\n21.91\n16.94\n22.44\n22.21\n22.91\n21.90\n22.27\n22.34\n22.94\n22.99\n22.97\n\nTable 2: Average PSNR (dB) for non-blind deconvolution on two datasets (*trained for \u03c3n = 2.55).\n\nNon-Blind Deblurring (NB). The gradient descent steps for non-blind deblurring with a known\nkernel and degradation noise variance are given in Table 1, top row (NB). Here K denotes the Toeplitz\nmatrix of the blur kernel k.\n\nNoise-Adaptive Deblurring (NA). When the degradation noise variance \u03c32\nsolve Equation (9) for the optimal \u03c32\n\nn (since it is independent of the prior), which gives\n\nn is unknown, we can\n\nBy plugging this back into the equation, we get the following data term\n\n(cid:2)|y \u2212 k \u2217 x|2 + M \u03c32|k|2(cid:3) .\nlog(cid:2)|y \u2212 k \u2217 x|2 + M \u03c32|k|2(cid:3),\n\n\u03c32\nn =\n\n1\nN\n\ndata(x) = \u2212 N\n2\n\n(21)\n\n(22)\n\nTable 1, second row (NA), where \u03bbt = N(cid:0)|y \u2212 Kxt\u22121|2 + M \u03c32|k|2(cid:1)\u22121 adaptively scales the data\n\nwhich is independent of the degradation noise variance \u03c32\n\nn. We show the gradient descent steps in\n\nterm with respect to the prior.\n\nNoise- and Kernel-Blind Deblurring (NA+KE). Gradient descent in noise-blind optimization\nincludes an intuitive regularization for the kernel. We can use the objective in Equation (22) to\njointly optimize for the unknown image and the unknown kernel. The gradient descent steps to\nupdate the image remain as in Table 1, second row (NA), and we take additional steps to update\nthe kernel estimate, as in Table 1, third row (KE). Additionally, we project the kernel by applying\nkt = max(kt, 0) and kt = kt\n|kt|1\n\nafter each step.\n\n5 Experiments and Results\n\nOur DAE uses the neural network architecture by Zhang et al. [39]. We generated training samples\nby adding Gaussian noise to images from ImageNet [10]. We experimented with different noise\nlevels and found \u03c31 = 11 to perform well for all our deblurring and super-resolution experiments.\nUnless mentioned, for image restoration we always take 300 iterations with step length \u03b1 = 0.1\nand momentum \u00b5 = 0.9. The runtime of our method is linear in the number of pixels, and our\nimplementation takes about 0.2 seconds per iteration for one megapixel on an Nvidia Titan X (Pascal).\n\n5.1\n\nImage Deblurring: Non-Blind and Noise-Blind\n\nIn this section we evaluate our method for image deblurring using two datasets. Table 2 reports\nthe average PSNR for 32 images from the Levin et al. [19] and 50 images from the Berkeley [2]\nsegmentation dataset, where 10 images are randomly selected and blurred with 5 kernels as in Jin et\nal. [14]. We highlight the best performing PSNR in bold and underline the second best value. The\n\n6\n\n\fGround Truth\n\nEPLL [41]\n\nDAEP [4] GradNet 7S [14]\n\nOurs\n\nOurs + NA\n\nFigure 2: Visual comparison of our deconvolution results.\n\nGround Truth\n\nBlurred with 1% noise\n\nOurs (blind)\n\nSSD Error Ratio\n\nFigure 3: Performance of our method for fully (noise- and kernel-) blind deblurring on Levin\u2019s set.\n\nupper half of the table includes non-blind methods for deblurring. EPLL [41] + NE uses a noise\nestimation step followed by non-blind deblurring. Noise-blind experiments are denoted by NA for\nnoise adaptivity. We include our results for non-blind (Ours) and noise-blind (Ours + NA). Our noise\nadaptive approach consistently performs well in all experiments and on average we achieve better\nresults than the state of the art. Figure 2 provides a visual comparison of our results. Our prior is able\nto produce sharp textures while also preserving the natural image structure.\n\n5.2\n\nImage Deblurring: Noise- and Kernel-Blind\n\nWe performed fully blind deconvolution with our method using Levin et al.\u2019s [19] dataset. In this test,\nwe performed 1000 gradient descent iterations. We used momentum \u00b5 = 0.7 and step size \u03b1 = 0.3\nfor the unknown image and momentum \u00b5k = 0.995 and step size \u03b1k = 0.005 for the unknown\nkernel. Figure 3 shows visual results of fully blind deblurring and performance comparison to state\nof the art (last column). We compare the SSD error ratio and the number of images in the dataset\nthat achieves error ratios less than a threshold. Results for other methods are as reported by Perrone\nand Favaro [24]. Our method can reconstruct all the blurry images in the dataset with errors ratios\nless than 3.5. Note that our optimization performs end-to-end estimation of the \ufb01nal results and we\ndo not use the common two stage blind deconvolution (kernel estimation, followed by non-blind\ndeconvolution). Additionally our method uses a noise adaptive scheme where we do not assume\nknowledge of the input noise level.\n\n5.3 Super-resolution\n\nTo demonstrate the generality of our prior, we perform an additional test with single image super-\nresolution. We evaluate our method on the two common datasets Set5 [3] and Set14 [36] for different\nupsampling scales. Since these tests do not include degradation noise (\u03c3n = 0), we perform our\noptimization with a rough weight for the prior and decrease it gradually to zero. We compare our\nmethod in Table 3. The upper half of the table represents methods that are speci\ufb01cally trained for\nsuper-resolution. SRCNN [11] and TNRD [7] have separate models trained for \u00d72, 3, 4 scales, and\nwe used the model for \u00d74 to produce the \u00d75 results. VDSR [16] and DnCNN-3 [39] have a single\nmodel trained for \u00d72, 3, 4 scales, which we also used to produce \u00d75 results. The lower half of the\ntable represents general priors that are not designed speci\ufb01cally for super-resolution. Our method\nperforms on par with state of the art methods over all the upsampling scales.\n\n7\n\n123430405060708090100% Below Error Ratio Sun et al.Wipf and ZhangLevin et al.Babacan et al.Log\u2212TV PDLog\u2212TV MMOurs\fscale:\n\nMethod\nBicubic\nSRCNN [11]\nTNRD [7]\nVDSR [16]\nDnCNN-3 [39]\nDAEP [4]\nIRCNN [40]\nOurs\n\n\u00d72\n31.80\n34.50\n34.62\n34.50\n35.20\n35.23\n35.07\n35.16\n\nSet5 [3]\n\u00d74\n\u00d73\n26.73\n28.67\n28.60\n30.84\n28.83\n31.08\n31.39\n29.19\n29.30\n31.58\n29.01\n31.44\n29.01\n31.26\n31.38\n29.16\n\n\u00d75\n25.32\n26.12\n26.88\n25.91\n26.30\n27.19\n27.13\n27.38\n\n\u00d72\n28.53\n30.52\n30.53\n30.72\n30.99\n31.07\n30.79\n30.99\n\nSet14 [36]\n\u00d74\n\u00d73\n24.44\n25.92\n25.76\n27.48\n25.92\n27.60\n27.81\n26.16\n26.25\n27.93\n27.93\n26.13\n25.96\n27.68\n27.90\n26.22\n\n\u00d75\n23.46\n24.05\n24.61\n24.01\n24.26\n24.88\n24.73\n25.01\n\nTable 3: Average PSNR (dB) for super-resolution on two datasets.\n\nMatlab [21] RTF [15] Gharbi et al. [13] Gharbi et al. [13] f.t.\n\n33.9\n\n37.8\n\n38.4\n\n38.6\n\nSEM [17] Ours\n38.7\n\n38.8\n\nTable 4: Average PSNR (dB) in linear RGB space for demosaicing on the Panasonic dataset [15].\n\n5.4 Demosaicing\n\nWe \ufb01nally performed a demosaicing experiment on the dataset introduced by Khashabi et al. [15].\nThis dataset is constructed by taking RAW images from a Panasonic camera, where the images are\ndownsampled to construct the ground truth data. Due to the down sampling effect, in this evaluation\nwe train a DAE with \u03c31 = 3 noise standard deviation. The test dataset consists of 100 noisy images\ncaptured by a Panasonic camera using a Bayer color \ufb01lter array (RGGB). We initialize our method\nwith Matlab\u2019s demosaic function [21]. To get even better initialization, we perform our initial\noptimization with a large degradation noise estimate (\u03c3n = 2.5) and then perform the optimization\nwith a lower estimate (\u03c3n = 1). We summarize the quantitative results in Table 4. Our method\nis again on par with the state of the art. Additionally, our prior is not trained for a speci\ufb01c color\n\ufb01lter array and therefore is not limited to a speci\ufb01c sub-pixel order. Figure 4 shows a qualitative\ncomparison, where our method produces much smoother results compared to the previous state of the\nart.\n\nGround Truth\n\nRTF [15]\n\nGharbi et al. [13]\n\nSEM [17]\n\nOurs\n\nFigure 4: Visual comparison for demosaicing noisy images from the Panasonic data set [15].\n\n6 Conclusions\n\nWe proposed a Bayesian deep learning framework for image restoration with a generic image prior\nthat directly represents the Gaussian smoothed natural image probability distribution. We showed that\nwe can compute the gradient of our prior ef\ufb01ciently using a trained denoising autoencoder (DAE).\nOur formulation allows us to learn a single prior and use it for many image restoration tasks, such as\nnoise-blind deblurring, super-resolution, and image demosaicing. Our results indicate that we achieve\nperformance that is competitive with the state of the art for these applications. In the future, we would\nlike to explore generalizing from Gaussian smoothing of the underlying distribution to other types of\nkernels. We are also considering multi-scale optimization where one would reduce the Bayes utility\nsupport gradually to get a tighter bound with respect to maximum a posteriori. Finally, our approach\nis not limited to image restoration and could be exploited to address other inverse problems.\n\n8\n\n\fAcknowledgments. MJ and PF acknowledge support from the Swiss National Science Foundation\n(SNSF) on project 200021-153324.\n\nReferences\n[1] Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data-generating\n\ndistribution. Journal of Machine Learning Research, 15:3743\u20133773, 2014.\n\n[2] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical\nimage segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):898\u2013916,\n2011.\n\n[3] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie-Line Alberi-Morel. Low-complexity\nsingle-image super-resolution based on nonnegative neighbor embedding. In British Machine Vision\nConference, BMVC 2012, Surrey, UK, September 3-7, 2012, pages 1\u201310, 2012.\n\n[4] Siavash Arjomand Bigdeli and Matthias Zwicker. Image restoration using autoencoding priors. arXiv\n\npreprint arXiv:1703.09964, 2017.\n\n[5] Antoni Buades, Bartomeu Coll, and J-M Morel. A non-local algorithm for image denoising. In Computer\nVision and Pattern Recognition (CVPR), 2005 IEEE Conference on, volume 2, pages 60\u201365. IEEE, 2005.\n\n[6] JH Chang, Chun-Liang Li, Barnabas Poczos, BVK Kumar, and Aswin C Sankaranarayanan. One net-\nwork to solve them all\u2014solving linear inverse problems using deep projection models. arXiv preprint\narXiv:1703.09912, 2017.\n\n[7] Yunjin Chen and Thomas Pock. Trainable nonlinear reaction diffusion: A \ufb02exible framework for fast and\neffective image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1256\u2013\n1272, 2017.\n\n[8] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 24(5):603\u2013619, 2002.\n\n[9] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising with\nblock-matching and 3d \ufb01ltering. In Electronic Imaging 2006, pages 606414\u2013606414. International Society\nfor Optics and Photonics, 2006.\n\n[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical\nimage database. In Computer Vision and Pattern Recognition (CVPR), 2009 IEEE Conference on, pages\n248\u2013255. IEEE, 2009.\n\n[11] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang.\n\nImage super-resolution using deep\nconvolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):295\u2013307,\n2016.\n\n[12] Rob Fergus, Barun Singh, Aaron Hertzmann, Sam T Roweis, and William T Freeman. Removing camera\nshake from a single photograph. In ACM Transactions on Graphics (TOG), volume 25, pages 787\u2013794.\nACM, 2006.\n\n[13] Micha\u00ebl Gharbi, Gaurav Chaurasia, Sylvain Paris, and Fr\u00e9do Durand. Deep joint demosaicking and\n\ndenoising. ACM Transactions on Graphics (TOG), 35(6):191, 2016.\n\n[14] M. Jin, S. Roth, and P. Favaro. Noise-blind image deblurring. In Computer Vision and Pattern Recognition\n\n(CVPR), 2017 IEEE Conference on. IEEE, 2017.\n\n[15] Daniel Khashabi, Sebastian Nowozin, Jeremy Jancsary, and Andrew W Fitzgibbon. Joint demosaicing and\ndenoising via learned nonparametric random \ufb01elds. IEEE Transactions on Image Processing, 23(12):4968\u2013\n4981, 2014.\n\n[16] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep\nconvolutional networks. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on,\npages 1646\u20131654. IEEE, 2016.\n\n[17] Teresa Klatzer, Kerstin Hammernik, Patrick Knobelreiter, and Thomas Pock. Learning joint demosaicing\nand denoising based on sequential energy minimization. In Computational Photography (ICCP), 2016\nIEEE International Conference on, pages 1\u201311. IEEE, 2016.\n\n[18] Dilip Krishnan and Rob Fergus. Fast image deconvolution using hyper-laplacian priors. In Advances in\n\nNeural Information Processing Systems, pages 1033\u20131041, 2009.\n\n9\n\n\f[19] Anat Levin, Rob Fergus, Fr\u00e9do Durand, and William T Freeman. Image and depth from a conventional\n\ncamera with a coded aperture. ACM Transactions on Graphics (TOG), 26(3):70, 2007.\n\n[20] Anat Levin and Boaz Nadler. Natural image denoising: Optimality and inherent bounds. In Computer\n\nVision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 2833\u20132840. IEEE, 2011.\n\n[21] Henrique S Malvar, Li-wei He, and Ross Cutler. High-quality linear interpolation for demosaicing of bayer-\npatterned color images. In Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP\u201904).\nIEEE International Conference on, volume 3, pages iii\u2013485. IEEE, 2004.\n\n[22] Tim Meinhardt, Michael M\u00f6ller, Caner Hazirbas, and Daniel Cremers. Learning proximal operators: Using\ndenoising networks for regularizing inverse imaging problems. arXiv preprint arXiv:1704.03488, 2017.\n\n[23] KOICHI Miyasawa. An empirical bayes estimator of the mean of a normal population. Bull. Inst. Internat.\n\nStatist, 38(181-188):1\u20132, 1961.\n\n[24] Daniele Perrone and Paolo Favaro. A logarithmic image prior for blind deconvolution. International\n\nJournal of Computer Vision, 117(2):159\u2013172, 2016.\n\n[25] J. Portilla, V. Strela, M. J. Wainwright, and E. P. Simoncelli. Image denoising using scale mixtures of\ngaussians in the wavelet domain. IEEE Transactions on Image Processing, 12(11):1338\u20131351, Nov 2003.\n\n[26] M Raphan and E P Simoncelli. Least squares estimation without priors or supervision. Neural Computation,\n\n23(2):374\u2013420, Feb 2011. Published online, Nov 2010.\n\n[27] Yaniv Romano, Michael Elad, and Peyman Milanfar. The little engine that could: Regularization by\n\ndenoising (red). arXiv preprint arXiv:1611.02862, 2016.\n\n[28] Stefan Roth and Michael J Black. Fields of experts: A framework for learning image priors. In Computer\nVision and Pattern Recognition (CVPR), 2005 IEEE Conference on, volume 2, pages 860\u2013867. IEEE, 2005.\n\n[29] Leonid I. Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms.\n\nPhysica D: Nonlinear Phenomena, 60(1):259 \u2013 268, 1992.\n\n[30] Uwe Schmidt, Jeremy Jancsary, Sebastian Nowozin, Stefan Roth, and Carsten Rother. Cascades of\nregression tree \ufb01elds for image restoration. IEEE transactions on pattern analysis and machine intelligence,\n38(4):677\u2013689, 2016.\n\n[31] Uwe Schmidt and Stefan Roth. Shrinkage \ufb01elds for effective image restoration. In Computer Vision and\n\nPattern Recognition (CVPR), 2014 IEEE Conference on, pages 2774\u20132781. IEEE, 2014.\n\n[32] Tamar Rott Shaham and Tomer Michaeli. Visualizing image priors. In European Conference on Computer\n\nVision, pages 136\u2013153. Springer, 2016.\n\n[33] Singanallur V Venkatakrishnan, Charles A Bouman, and Brendt Wohlberg. Plug-and-play priors for model\n\nbased reconstruction. In GlobalSIP, pages 945\u2013948. IEEE, 2013.\n\n[34] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing\nrobust features with denoising autoencoders. In Proceedings of the 25th International Conference on\nMachine Learning, pages 1096\u20131103. ACM, 2008.\n\n[35] Lei Xiao, Felix Heide, Wolfgang Heidrich, Bernhard Sch\u00f6lkopf, and Michael Hirsch. Discriminative\n\ntransfer learning for general image restoration. arXiv preprint arXiv:1703.09245, 2017.\n\n[36] Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations.\n\nIn International Conference on Curves and Surfaces, pages 711\u2013730. Springer, 2010.\n\n[37] Haichao Zhang and David Wipf. Non-uniform camera shake removal using a spatially-adaptive sparse\n\npenalty. In Advances in Neural Information Processing Systems, pages 1556\u20131564, 2013.\n\n[38] Haichao Zhang and Jianchao Yang. Scale adaptive blind deblurring. In Advances in Neural Information\n\nProcessing Systems, pages 3005\u20133013, 2014.\n\n[39] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser:\n\nResidual learning of deep cnn for image denoising. arXiv preprint arXiv:1608.03981, 2016.\n\n[40] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang. Learning deep cnn denoiser prior for image\n\nrestoration. arXiv preprint arXiv:1704.03264, 2017.\n\n[41] Daniel Zoran and Yair Weiss. From learning models of natural image patches to whole image restoration.\nIn Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 479\u2013486. IEEE,\n2011.\n\n10\n\n\f", "award": [], "sourceid": 517, "authors": [{"given_name": "Siavash", "family_name": "Arjomand Bigdeli", "institution": "Universit\u00e4t Bern"}, {"given_name": "Matthias", "family_name": "Zwicker", "institution": "University of Maryland, College Park"}, {"given_name": "Paolo", "family_name": "Favaro", "institution": "University of Bern"}, {"given_name": "Meiguang", "family_name": "Jin", "institution": "University of Bern"}]}