About Ugako and How it Works
Ugako is a public domain image search. It allows searching and finding images with a focus on images usable for internet articles and has unique image editing functionality. At the time of writing, over 250000 images are in Ugako's search index.
Ugako is a useful tool, but also a showcase and a way for the author to experiment with technologies. Whilst it'll work just fine with ordinary keyword queries, due to the way it's made it works much better with phrases and titles of, for example, articles.
So how does it work?
The first step is, of course, images. A number of web crawlers, written in Python, look for images on the web marked as being in the public domain.
Once an image is found we need to add it to some sort of database to make it easy to look up. This isn't a traditional database, it's optimised for fast lookup. We call it an index. Here we come across our first problem: what exactly do we store in the index?
We're matching text to images. So at it's simplest we want to store those connections in the database. Say the picture is of a plant, then it needs to be stored as 'plant'. But what if it is a flower? Then it needs to be stored as 'flower', but it's also a 'plant'. If somebody searched for 'plant' we'd want the results to show plants first but go onto flowers when it's run out. What about the colors? What if somebody searches for 'red flower' or 'yellow flower'? We don't want it to simply show any flowers, but ones of the specific color. So we must index that.
There's a lot of possible things to look out for and then put into this index for each image. Thankfully, because it would otherwise be an impossibly large index, we can use AI for this. Even better, somebody already trained one on over 400 million image and text pairs - OpenAI CLIP. So we use this. This gives us a series of numbers, known as a vector, that we can put into the index.
Once we have the vectors in an index we need to find a way to query them so that when somebody enters some text it finds the images. To do that we need to understand vectors. Let's say we have one dimensional chart, otherwise known as a line. At one end of the chart is "not dog" and at the other end is "definitely dog".
When somebody queries dog, we take the text "dog" to mean the "definitely dog" end of the chart. If we plot every image on this chart from "not dog" to "definitely dog", we clearly want to return to the user the ones closest to where their query is on the line. That's "definitely dog", so it returns dog pictures. Handy!
Now let's consider two things. So we have a graph where one axis runs from "not dog" to "definitely dog" and another runs from "not cat" to "definitely cat". If we plot all our images onto this graph then they will be scattered around, some dogs look a bit like cats too. The same principle applies though, we could simply find the closest images on the chart to "definitely dog" and it will now return the user images that are most likely dogs and least likely cats first.
As you can imagine, this gets hard for humans to visualise fast. We could add a third axis for rabbits, then blow our minds by adding another one for horses, a fifth for sheep, and so on. One of the differences between a computer and the human brain, is that the computer doesn't see this as a problem. To a computer it's just the same thing - calculate which images would be closest if you made a graph with each of these things and find the closest images.
It's important to say here, we don't know exactly what each item in the vector from the OpenAI CLIP stands for because it's an AI. Almost certainly there isn't a number in the vector which represents 'dog' exactly, or 'cat'. But it doesn't matter as long as the AI has learned these things, it's in their somewhere, we just don't know where. But the algorithm works the same. We call this algorithm nearest neighbour search.
A Problem with Time
Imagine now that we've converted the user's search query to a vector. As I just explained we simply find the images that have the closest vectors. Another way of putting that is that for every image we need to calculate how close they are to the query vector.
If someone asked you to peel 10 potatoes, you could do it. If someone asked you to peel 250000 potatoes then you'd rightly look at them like they're mad. Computers have the same problem. Even though they can do things very very fast, the amount of time taken still depends on how many times it is performing the task. If there were 1000 images in the index then it may be a reasonable approach to calculate the closest images for each query by testing all of them. But at 250000 that takes too long.
Some potatoes have a strange shape that makes them harder to peel. If someone asked you to peel lots, you could peel the easy ones, and discard the difficult ones. That would save you time. The dirty secret of most of these kind of search algorithms is we can do pretty much the same thing. We call this approximate nearest neighbour search. Rather than search absolutely every image, say for 'cat', we make the assumption that given you a good image of a cat is enough rather than the absolute most cat like image we have. That allows us to calculate the distance not necessarily for every image but for the most likely to close first. We arrange the index in a different way optimised for the likelihood. Ugako uses the ANNOY Approximate Nearest Neighbour algorithm.
After all of that we have a list of images that are closest to the search query (approximately of course!). So we display them right? Nearly!
It turns out that the internet is full of images that not everybody should see and in some cases some that arguably nobody would want to see. Public domain images quite often turn out to be adult in nature. We'll want to filter those.
To do that Ugako uses a binary classifier neural network. The word "classifier" because it classifies! Or in other terms says if it is in a group of not. "Binary" means it has two states. Or in other words this AI only gives two answers: either this image should be filtered or this image should not be filtered. By running the results through this AI before displaying we can remove those marked to be filtered.
Next we need to consider that we have our results in order with the closest to, for example, 'cat' first. The first image is the most cattiest of cats according to an AI. This is a slightly different question to what the user making a query is looking for. An image can be the very recognisable as a cat but simultaneously be an image that is visually unappealing to humans. It could have bad lighting, bad colours, bad composition, and so on.
So for the next task we want to adjust this ordering so that not only the cattiest of cats comes first but that there is also a tendency for visually appealing images to come earlier in the ranking than they otherwise would. Ugako uses a custom trained neural network to allocate a score to an image and then adjusts the ranking accordingly.
Finally we can display the results.
Behind the Scenes
Doing this is not quite as simple as that though. Remember the index, it's nearly 1GB in size. Imagine that every time you made a query, a computer had to load up a 1GB file to do the search. You could have the fastest search possible but that loading of the file is going to take a long time.
To get around this we want the web server to load the index into the much faster RAM memory and to do this only once and keep it there, rather than every time a query is loaded. Here we're lucky, Python tends to be used for AI things, we do a lot of AI, and there's tools for Python that let us make a webserver for just that. We use Gunicorn. To handle more complex web services (such as SSL encryption) we front that with Nginx. This then means that we must have enough RAM memory on the server to hold all of this and leads us to another problem: memory page swapping. Computers, by design, will swap unused bits of memory to the disk. The logic here is that it keeps the faster RAM memory free for things being used and run more. But that's not what we want. If it's moved to disk, that's like loading the file again when someone makes a search query. The server is therefore configured to page swap memory as little as possible.
Through a perfect combination of AI, technologies, and special configuration, we've guided the searcher to their perfect image and they've clicked on it. Or techically, I suppose, we must now admit the approximately perfect image! But there's likely other problems with the image. It may be the wrong size, aspect ratio, the colors might not pop, or something else.
It'd be great if the searcher could modify the image a bit. So Ugako let's them. Image/Editing/Manipulation is something that CPUs of computers could do, but GPUs are much better and faster and allow greater variety of editing tools. For example, Ugako goes beyond the usual things like Contrast and Saturation, to as far as users being able to apply LUTs to the images. It could do this using GPUs on the servers or specialised hosted GPUs. But this is extraordinarily expensive, especially when you consider that with few (if any) exceptions there's able perfectly good GPU in the user's computer!
And that's all there is to it. Simples :)