Stylization of images using neural networks: no mysticism, just mattan. Ostagram: Neural network-based service, combining photo and ornaments in art masterpieces Neural network artist

Stylization of images using neural networks: no mysticism, just mattan. Ostagram: Neural network-based service, combining photo and ornaments in art masterpieces Neural network artist
Stylization of images using neural networks: no mysticism, just mattan. Ostagram: Neural network-based service, combining photo and ornaments in art masterpieces Neural network artist

Since in August 2015, German researchers from the University of Tubingen submitted their choice of the style of famous artists to other photos, services began to appear that this opportunity was monetized. In the Western market he was launched, and on Russian - his full copy.

To bookmarks

Despite the fact that Ostagram launched back in December, he began to quickly gain popularity in social networks in mid-April. At the same time, there were less than a thousand people in the project in Vkontakte on April 19.

To use the service, you need to prepare two images: a photo to be processed, and a picture with an example of style for overlaying on the original picture.

The service has a free version: it creates an image in minimal resolution up to 600 pixels along the longest side of the picture. The user receives the result of only one of the iterations of the filter applied to the photo.

Paid versions Two: Premium gives a picture up to 700 pixels along the longest side and applies to the image of 600 iterations of processing neural network (the more iterations, the more interesting and more intensive processing). One such snapshot will be 50 rubles.

In the HD version, you can set up the number of iterations: 100 will cost 50 rubles, and 1000 - 250 rubles. In this case, the image will have a resolution up to 1200 pixels along the longest side, and it can be used to print on canvas: Ostagram offers such a service with delivery from 1800 rubles.

In February, Ostagram representatives that will not accept requests for image processing from users from "from countries with developed capitalism", but then access to photo processing for users "VKontakte" from around the world. Judging by the Ostagram code published on GitHub, Sergei Morugin, a 30-year-old resident of Nizhny Novgorod was engaged in its development.

TJ contacted the commercial director of the project that was introduced by Andrey. According to him, Ostagram appeared before instapainting, but he was inspired by a similar project called Vipart.

The development of Ostagram was engaged in a group of students from NSTU. Alekseeva: After initial testing on a narrow group of friends at the end of 2015, the project was decided to make public. Initially, image processing was completely free, and it was planned to earn money on the sale of printed paintings. According to Andrei, the seal turned out to be the biggest problem: photo people treated with neural vehicles rarely look nice for human eyes, and the final client needs to customize the result for a long time before applying to canvas, which requires large machine resources.

For image processing, the creators of Ostagram wanted to use the Amazon cloud servers, but after the influx of users it became clear that the costs of them would exceed a thousand dollars a day with a minimal return of investment. Andrei, simultaneously being an investor of the project, rented server power in Nizhny Novgorod.

The project's audience is about a thousand people per day, however, at some days she reached 40 thousand people at the expense of transitions from foreign media who had already managed to notice the project before the domestic (Ostagram even managed to rise with European DJs). At night, when the traffic is low, the image processing can take place in 5 minutes, and day to occupy up to an hour.

If earlier foreign users have consciously limited access to image processing (to start the monetization to start with Russia), now Ostagram is already counting on the Western audience.

To date, the prospects for payback are conditional. If each user had paid for processing 10 rubles, then perhaps it would be boiling. [...]

We are very hard to monetize in our country: we are ready to wait for a week, but will not pay a penny for it. The Europeans for this are more favorable - in terms of paying for a lifting, quality improvement - therefore, the orientation goes to the market.

Andrei, Representative Ostagram

According to Andrei, the Ostagram team is working on a new version of the site with a big bias in sociality: "It will be similar to one well-known service, but what to do." The project was already interested in representatives of Facebook in Russia, but before the negotiations on the sale did not reach the service.

Examples of service

In the ribbon on the Ostagram website, it is also possible to see the combination of which images it turned out the final pictures: it is often even more interesting than the result. At the same time, filters are pictures used as an effect for processing - can be saved for further use.

Greetings to you, Habr! Surely you noticed that the topic of styling photos for various artistic styles is actively discussed in these ones. Reading all these popular articles, you might think that under the hood of these applications, magic is going on, and the neural network really fantasies and redraws the image from scratch. It so happened that our team was faced with a similar task: in the framework of the internal corporate hackaton we made a video stylization, because The application for photos was already. In this post, we will understand how this network "redraws" images, and we will analyze the articles, thanks to which it became possible. I recommend to get acquainted with the last post before reading this material and in general with the foundations of convolutional neural networks. You are waiting for a little formula, a little code (examples I will lead to Theano and Lasagne), as well as many pictures. This post is built in the chronological procedure for the appearance of articles and, accordingly, ideas themselves. Sometimes I will dilute it to our recent experience. Here is a boy from hell to attract attention.


Visualizing and Understanding Convolutional Networks (28 Nov 2013)

First of all, it is worth mentioning the article in which the authors were able to show that the neural network is not a black box, but quite an interpretable thing (by the way, today it can be said not only about convolutional networks for computer vision). The authors decided to learn how to interpret the activation of the neurons of the hidden layers, for this they used the deconvolutionary neural network (DeconvNet) proposed by several years earlier (by the way, the same Zayler and Fergus, which are authors and this publication). Deconvolutionary network is in fact the same network with convulsions and blogings, but applied in reverse order. In the original DECONVNET operation, the network was used in the training mode without a teacher to generate images. This time, the authors applied it simply for the opposite pass from the signs obtained after the direct passage over the network, to the original image. As a result, it turns out an image that can be interpreted as a signal that caused this activation on the neurons. Naturally, the question arises: how to make the opposite passage through a convolution and nonlinearity? And especially through Max-Pulling, it is certainly not an inverted operation. Consider all three components.

Reverse Relu.

In convolutional networks as an activation function is often used RELU (X) \u003d MAX (0, X)which makes all activations on the layer not negative. Accordingly, when the passage through nonlinearity, it is also necessary to obtain not negative results. For this, the authors offer to use the same RELU. From the point of view of Theano architecture, it is necessary to override the function of the gradient of the operation (an infinitely valuable laptop is in Lazagan's recipes, from there you will handle the details of what is for the modifiedbackprop class).

Class ZeilerBackProp (modifiedbackprop): DEF Grads (Self, Inputs, Out_Grads): (INP,) \u003d INPUTS (GRD,) \u003d OUT_GRADS #RETURN (GRD * (GRD\u003e 0) .ASTYPE (INP.DTYPE),) # EXPLICITLY RECTIFY Return (Self.nonlinearity (GRD),) # USE The Given Nonlinearity

Reverse drill

It is a bit more complicated here, but everything is logical: it is enough to apply the transposed version of the same coat kernel, but to the outputs from the reelu instead of the previous layer used at the direct pass. But I am afraid that in words it is not so obvious, we will look at the visualization of this procedure (you will find even more bundle visualizations).


Cut with stride \u003d 1

Cut with stride \u003d 1 Reverse version

Cut with stride \u003d 2

Cut with stride \u003d 2 Reverse version

Reverse Pulling

This is this operation (unlike previous) generally speaking not inverted. But we still wanted to go through a maximum in the opposite passage. For this, the authors offer to use a map of where there was a maximum of direct pass (MAX Location Switches). When inverse passage, the input signal into appears is converted to approximately save the structure of the source signal, it is really easier to see what to describe.



Result

The visualization algorithm is extremely simple:

  1. Make a direct pass.
  2. Select the layer of interest to us.
  3. Secure activation of one or more neurons and reset the rest.
  4. Make reverse output.

Each gray square in the image below corresponds to the filter visualization (which is used for a convolution) or the weights of one neuron, and each color picture is the part of the original image that activates the corresponding neuron. For clarity, neurons inside one layer are grouped into thematic groups. In general, it was suddenly that the neural network learn exactly what Hewubel and Weizel wrote about the structure of the visual system, for which the Nobel Prize in 1981 was honored. Thanks to this article, we received a visual representation of what learn a convolutional neural network on each layer. It is these knowledge that will allow you to manipulate the contents of the generated image later, but before that, the following few years have passed the improvement of the methods of "treason" neural networks. In addition, the authors of the article proposed a way to analyze how it is better to build a convolutional neural network architecture to achieve the best results (though, ImageNet 2013 they did not win, but got into the top; Upd.: Taki turns out to be won, Clarifai is they, they are).


Visualization Fich


Here is an example of visualizing activation using DeconvNet, today this result is already so-so, but then it was a breakthrough.


Saliency Maps using Deconvnet

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps (19 Apr 2014)

This article is devoted to the study of methods for visualizing knowledge enclosed in a convolutional neural network. The authors offer two ways of visualization based on the gradient descent.

Class Model Visualization

So, imagine that we have a trained neural network to solve the classification task for some number of classes. Denote by the value of the activation of the output neuron, which corresponds to the class c.. Then the next task of optimization gives us exactly the image that maximizes the selected class:



This task is easy to decide using theano. Usually we ask the framework to take a derivative according to the parameters of the model, but this time we believe that the parameters are fixed, and the derivative is taken through the input image. The following function selects the maximum value of the output layer and returns a function that calculates the derivative of the input image.


def compile_saliency_function: "" "Compiles a Function to Compute the Saliency Maps and Predicted Classes for a Given MiniBatch of Input Images." "" Inp \u003d Net ["Input"]. Input_Var Outp \u003d lasagne.layers.get_output (net ["fc8"], deterministic \u003d true) max_outp \u003d t.max (OUTP, axis \u003d 1) Saliency \u003d Theano.grad (max_outp.sum (), WRT \u003d INP) max_class \u003d t.argmax (OUTP, AXIS \u003d 1) Return theano.function ()

You probably saw interns strange images with dogs of dogs - Deepdream. In the original article, the authors use the following process to generate images that maximize the selected class:

  1. Initialize the initial image of zeros.
  2. Calculate the value of the derivative on this image.
  3. Change the image by adding the resulting image from the derivative.
  4. Back to point 2 or exit a cycle.

Such such images are obtained:




And if you initialize the first image of a real photo and run the same process? But on each iteration, we will choose a random class, reset the rest and calculate the value of the derivative, then it will be such a Deep Dream.


Caution 60 MB


Why so much face dogs and eyes? Everything is simple: in the event of almost 200 dogs from 1000 classes, they have eyes. As well as many classes where people just have.

Class Saliency Extraction

If this process is initialized by the real photo, stop after the first iteration and deny the value of the derivative, then we will receive such an image by adding which to the original one, we will increase the value of the activation of the selected class.


Saliency Maps using derivative


Again the result "so-so". It is important to note that this is a new way to visualize activations (nothing prevents us from fixing activation values \u200b\u200bnot on the last layer, but in general on any layer of the network and take a derivative of the input image). The next article will combine both previous approaches and gives us a tool to customize the shuttle service to the style that will be described later.

String For Simplicity: The All Convolutional Net (13 Apr 2015)

This article generally speaking not about visualization, but that the replacement of the pullea convolution with a large straw does not lead to loss of quality. But as a by-product of their research, the authors offered a new way of visualizing the feature, which they applied to a more accurate analysis of what learn the model. Their idea is as follows: if we just take a derivative, then with deconvolutions, those features that were on the input image are less than zero (the use of RELU for the input image) is not back during deconvolution. And this leads to the fact that negative values \u200b\u200bappear on the previated back image. On the other hand, if you use DeconvNet, then another relu is taken from the RELU derivative - it allows you to not pass back negative values, but as you saw the result, it turns out "so-so". But what if you combine these two methods?




Class Guidedbackprop (modifiedbackprop): DEF GRAD (Self, Inputs, Out_Grads): (INP,) \u003d INPUTS (GRD,) \u003d OUT_GRADS DTYPE \u003d INP.DTYPE RETURN (GRD * (INP\u003e 0) .ASTYPE (DTYPE) * (GRD \u003e 0) .ASTYPE (DType),)

Then it turns out a completely clean and interpretable image.


Saliency Maps using Guided Backpropagation

Go Deeper.

Now let's think about it, what does it give us? Let me remind you that each coaching layer is a function that receives a three-dimensional tensor and an output to the output also gives a three-dimensional tensor, perhaps another dimensionality d. X. w. X. h.; d.ePTH is the number of neurons in the layer, each of them generates a plate (Feature Map) w.igth X. h.eIGHT.


Let's try to hold the following experiment on the VGG-19 network:



cONV1_2.

Yes, you almost do not see anything, because The recipe area is very small, this is the second convolution of 3x3, respectively, the total area is 5x5. But increasing, we will see that the feature is just a gradient detector.




cONV3_3.


cONV4_3.


cONV5_3.


pool5.


And now we will imagine that instead of a maximum on a raid, we will take the derivative value of all the elements of the dice in the input image. Then obviously the recipe area of \u200b\u200bthe neuron group will cover all the input image. For the early layers, we will see bright cards from which we conclude that these are flower detectors, then gradients, then borders and so on in the direction of the complication of patterns. The deeper the layer, the more dull image it turns out. This is explained by the fact that the deeper layers, a more complex pattern, which they detect, and the complex pattern appears less frequently, the simple, therefore, the activation map fastens. The first method is suitable for understanding the layers with complex patterns, and the second is just for simple.


conv1_1


cONV2_2.


cONV4_3.


You can download a more complete activation database for several images and.

A Neural Algorithm of Artistic Style (2 SEP 2015)

So, passed a couple of years from the moment of the first successful trepanation of the neural network. We (in the sense - in humanity) there are a powerful tool on the hands, which makes it possible to understand what learns the neural network, and also remove what we won't really like it to learn. The authors of this article are developing a method that allows one image to generate a similar activation card for a target image, and perhaps not even one thing - this is the basis of stylization. We serve white noise on the entrance, and a similar iterative process as in Deep Dream, we present this image to this whose signs of signs are similar to the target image.

Content Loss.

As already mentioned, each layer of the neural network produces a three-dimensional tensor of some dimension.




Denote by way i.Layer from the entrance as. Then if we minimize the weighted sum diffations between the input image and some image to which we strive c., then it turns out exactly what you need. Probably.



For experiments with this article, this magic laptop can be used, calculations occur there (both on the GPU and the CPU). The GPU is used to calculate the feature of the neural network and the value of the cost function. Theano gives a function that can calculate the gradient of the target function eval_Grad. In the input image x.. Then this is all served in LBFGS and the iterative process is launched.


# Initialize with a noise image generated_image.set_value (floatx (np.random.uniform (-128, 128, (1, 3, image_w, image_w)))) x0 \u003d generated_image.get_value (). Astype ("Float64") XS \u003d xs.append (x0) # optimize, Saving The Result Periodically for i in Range (8): Print (i) scipy.optimize.fmin_l_bfgs_b (EVAL_LOSS, X0.Flatten (), FPRIME \u003d EVAL_GRAD, MAXFUN \u003d 40) x0 \u003d generated_image.get_value (). ASTYPE ("Float64") XS.APPEND (X0)

If we start the optimization of such a function, then we quickly get an image similar to the target. Now we can recreate images similar to some content image.


Content Loss: CONV4_2



Process optimization




Easy to notice two features of the image received:

  • the colors were lost - this is the result of the fact that in a specific example used only a layer of CONV4_2 (or, in other words, the weight W was used with it, and for the remaining zero layers); As you remember, it is the early layers that contain information about the colors and gradient transitions, and later contain information about larger items that we observe - the colors are lost, and there is no content;
  • some houses "went", i.e. The straight lines slightly twisted - this is because the deeper layer, the less information about the spatial position of the feature is contained (the result of the use of bundle and cliffs).

Adding early layers immediately corrects the situation with flowers.


Content Loss: CONV1_1, CONV2_1, CONV4_2


I hope for this point you felt that you can manage what will be perverted on an image from white noise.

Style Loss.

And so we got to the most interesting: what about us to pass the style? What is style? Obviously, style is not that we have optimized in Content Loss "e, because there is a lot of information about the spatial positions of the features. So the first thing to be done is in any way to remove this information from the representations obtained on each layer.


The author offers the following way. We take a tensor at the exit from some layer, we will expand on the spatial coordinates and consider the covariance matrix between the dies. Denote this conversion as G.. What did we actually do? It can be said that we counted how often signs inside the dice are found in pairs, or, in other words, we approximated the distribution of signs in the dies with a multidimensional normal distribution.




Then Style Loss is entered as follows, where s. - This is some image with style:



Let's try for Vincent? We obtain, in principle, something expected is the noise in the style of Van Gogh, information about the spatial location of the features is completely lost.


Vincent




And what if you put a photo instead of the style? It turns out already familiar features, familiar colors, but the spatial position is completely lost.


Photo with Style Loss


Surely you wondered about, why do we calculate the covariance matrix, and not something else? After all, there are many ways to aggregate signs so that spatial coordinates are lost. This is really an open question, and if you take something very simple, then the result will not change dramatically. Let's check it, we will calculate the covariance matrix, but simply the average value of each dice.




simple Style Loss

Combined loss

Naturally, a desire arises to mix these two functions of the cost. Then we will generate from white noise that there will be signs from the content-image (which have a binding to spatial coordinates), and "style" signs that are not tied to spatial coordinates will be present, i.e. We hope that the details of the content of the content will remain intact from their places, but will be redrawn with the desired style.



In fact, there is also a regularizer, but we will define it for simplicity. It remains to answer the next question: what kind of layers (weight) use when optimizing? And I am afraid that I do not have an answer to this question, and the authors of the article also. They have a proposal to use the following, but it does not mean at all that another combination will work worse, too much search space. The only rule that follows from the understanding of the model: It makes no sense to take the neighboring layers, because They will not differ signs from each other, because the style is added via a layer from each group CONV * _1.


# Define Loss Function Losses \u003d # Content Loss Losses.append (0.001 * Content_Loss (photo_features, gen_features, "conv4_2")) # style loss losses.append (0.2E6 * Style_Loss (Art_features, Gen_features, "conv1_1")) losses.append (Art_features, Gen_Features, "Conv2_1")) losses.append (0.2E6 * Style_Loss (Art_features, Gen_features, "Conv3_1")) losses.append (0.2E6 * Style_Loss (Art_features, Gen_Features, "CONV4_1") ) losses.append (0.2E6 * style_loss (Art_features, gen_features, "conv5_1")) # Total Variation Penalty Losses.append (0.1E-7 * Total_variation_loss (generated_image)) Total_Loss \u003d SUM (losses)

The final model can be represented in the following form.




But the result of houses with Van Gogh.



Attempt to control the process

Let's remember the previous parts, already two years before the current article, other scientists investigated what really learns the neural network. Armed with all these articles, you can heal the visualization of the features of various styles, various images, various permits and sizes, and try to understand which layers with what weight to take. But even the tiling of the layers does not fully control over what is happening. The problem here is more conceptual: we are not optimizing that function! How do you ask? The answer is simple: this feature minimizes a residual ... Well, you understand. But what we really want is that we like the image. The convex combination of Content and Style Loss features is not a measure of the fact that our mind considers beautiful. It was noted that if you continue to stylization for too long, the cost function naturally falls below and below, but the aesthetic beauty of the result drops sharply.




Well, okay, there is another problem. Suppose we found a layer that removes the signs you need. Suppose some triangular textures. But this layer still contains many other signs, such as circles that we do not really want to see on the resulting image. Generally speaking, if you could hire a million Chinese, you could visualize all the features of the style of the style, and the complete prosperity simply note the ones that we need, and only include them in the cost function. But for obvious reasons, this is not so simple. But what if we just delete all the circles that we do not want to see the result, from the style of the style? Then they simply do not work activate the corresponding neurons that react to the circles. And, of course, then in the resulting picture it will not appear. The same with flowers. Imagine a bright image with lots of colors. The distribution of colors will be very flawed throughout the space, the same will be the distribution of the resulting image, but in the process of optimization, those peaks that were on the original are probably lost. It turned out that a simple decrease in the color palette color solves this problem. The distribution density of most colors will be zero, and there will be large peaks in several sections. Thus, by manipulating the original in Photoshop, we manipulate the signs that are retrieved from the image. A person is easier to express his desires visually than try to formulate them in the language of mathematics. Until. As a result, designers and managers, armed with photoshop and scripts to visualize signs, achieved times three times the result is better than what mathematics with programmers made.


An example of a manipulation of the color and the size of the features


And you can take a simple image as a style



results








And here is a widget, but only with the desired texture

Texture Networks: Feed-Forward Synthesis Of Textures and Stylized Images (10 Mar 2016)

It seems that this could be stopped, if not one nuance. The stylization algorithm described above works for a very long time. If you take a realization where LBFGS starts on the CPU, the process takes five minutes. If you rewrite so that both the optimization goes to the GPU, the process will take 10-15 seconds. It is not good anywhere. Perhaps the authors of this and the next article thought about the same. Both publications came out independently with the difference of 17 days, after almost a year after the previous article. The authors of the current article, as well as the authors of the previous one, were engaged in the generation of textures (if you just reset the Style Loss approximately it will succeed). They offered to optimize not an image obtained from white noise, but some neural network that generates a stylized image.




Now if the stylization process does not include any optimization, only a direct passage is required. And optimization is required only once to workout a network generator. This article uses a hierarchical generator, where every next z. The size is greater than the previous one and sessable from noise in the case of texture generation, and from a certain image base for a stylist. It is critical to use something different from the training part of the IMAJNE, because The features inside the LOSS network are calculated by the network trained just at the training part.



Perceptual Losses for Real-Time STYLE TRANSFER AND SUPER-RESOLUTION (27 Mar 2016)

As can be seen from the name, the authors who were late for only 17 days with the idea of \u200b\u200bthe generating network were engaged in an increase in image resolution. They apparently were inspired by the success of Residual Learning on the last event.




Accordingly, RESIDUAL BLOCK and CONV BLOCK.



Thus, now we have in your hands in addition to controlling stylization there is also a quick generator (thanks to these two articles, the generation time of one image is measured with tens of MS).

Ending

Information from the considered articles and the authors code we used as an starting point to create another application for styling the first video stylization application:



Generate something like that.


Numerous and not completely distinguishable essences appear on the most ordinary photographs. Most often for some reason dogs. This picture of the Internet began to fill in June 2015, when Deepdream from Google was launched - one of the first open services based on neural networks and intended for image processing.

It happens approximately as: the algorithm analyzes the photos, finds fragments that remind him of any familiar objects - and distort the image in accordance with these data.

At first, the project was posted as an open code, and then online services created on the Internet, created on the same principles. One of the most convenient and most popular is Deep Dream Generator: a small photo processing here takes only about 15 seconds (earlier users had to wait more than an hour).

How do neural networks learn to create such images? And why, by the way, are they so called?

Neural networks on their device mimic real neural networks of a living organism, but they do it with the help of mathematical algorithms. By creating a basic structure, you can train it according to machine learning methods. If we are talking about recognizing images, then thousands of images need to be skipped through the neural network. If the task of the neurosette is different, then the training exercises will be different.

Algorithms for playing chess, for example, analyze chess games. The same algorithm Alphago from Google Deepmind in the Chinese game GO - which was perceived as a breakthrough, since it is much more difficult and worth it more than chess.

    Playing up with a simplified model of neural network and it is better to understand its principles.

    YouTube also has a series of personalized hand drawn rollers About how neural networks work.

Another popular service is Dreamscope, which can not only dream of dogs, but also imitate various picturesque styles. Image processing here also occurs very simple and quickly (about 30 seconds).

Apparently, the algorithmic part of the service is a modification of the NEURAL STYLE program, which we are already.

Most recently there was a program that realistically paints black and white images. In previous versions, similar programs coped with their task much far away, and was considered a great achievement, if at least 20% of people can not distinguish a real picture from the image painted by the computer.

Moreover, coloring here takes only about 1 minute.

The same development company also launched a service that recognizes different types of objects in pictures.

These services may seem just funny entertainment, but in fact everything is much more interesting. New technologies are included in the practice of people's artists and change our ideas about art. Probably, soon people will have to compete with cars and in the field of creativity.

Teach the algorithms to recognize images - the task over which the developers of artificial intelligence have long been fighting. Therefore, programs that color old pictures are drawn in the sky of dogs can be considered part of a larger and intriguing process.