Written By
Dr. Ir. Thierry Deruyttere
Written on 16/08/2023

Matching fashion items in the street to webshop items

Keywords: Computer Vision, Deep Learning

The You & AI team embarked on a project aimed at linking fashion items observed on the streets with corresponding products available in online stores. The central goal of this initiative was to devise a practical tool that could be harnessed by fashion brands. Its purpose? To automate the process of associating street-spotted fashion pieces with their counterparts in the brands' webshops.
Ultimately, the envisioned outcome is to provide users with an effortless means to identify and locate similar items they encounter in everyday scenarios within the digital storefronts of their favorite fashion labels. The tool operates by detecting the various articles of clothing worn by individuals and subsequently cross-referencing these items with the offerings featured in the webshop.
The initial phase involved the training of an object detection model tasked with identifying all fashion items present within an image. This model underwent training utilizing the Street2Shop dataset, a compilation of images depicting individuals in urban settings juxtaposed with corresponding images showcasing items available on web-based retail platforms.
Subsequently, we proceeded to train a Convolutional Neural Network (CNN) within a siamese network framework. This particular neural architecture was employed to facilitate the correlation of the detected fashion items with their analogous counterparts within the webshop's inventory. A siamese network is a neural network that consists of two identical subnetworks that share the same parameters and weights. The model is trained on pairs of images, and the goal is to learn a similarity function that can predict whether two images are similar or dissimilar.
Below, we show the main idea behind the siamese network. In image (a), the similar object (indicated with a green rectangle) is a certain margin m away from our anchor image. The goal of the siamese network is to create embeddings such that the distance between the anchor image and the similar image is smaller than the distance between the anchor image and the dissimilar image. This is illustrated in image (b). Thesis
Other approaches were also explored, including the utilization of a triplet network and even a quadruplet network.
In the following figure we display the idea behind the triplet loss. This time, we have three images: the anchor image, the similar image, and the dissimilar image. The goal is to create embeddings such that the distance between anchor and dissimilar is larger than the distance between anchor and similar image plus a certain margin m. This is illustrated in image (a) and (b). In image (a) we have the situation where the distance between anchor and similar image plus the margin m is larger than the distance between anchor and dissimilar image. This results in a non-zero triplet loss. In image (b) we have the opposite situation. Hence, in the latter, the triplet loss is zero. Thesis
Finally, we display the idea behind the quadruplet loss. As the name implies, it consists of four samples: x1, x+1, x2, x+2. The goal is to create embeddings such that the distance between x1 and x+1 and the distance between x2 and x+2 is small. All other distances should be large and thus all other embeddings should push each other away. Thesis The difference between the triplet and quadruplet losses and the resulting inference space is shown below image. As seen in the figure, the quadruplet loss results in a bigger inter-class distance and a smaller intra-class distance. Which is better for generalization. Thesis
The issue with these two approaches was that they increase the size of your training set exponentially, which in turn leads to a significant increase in training time. The siamese network approach, on the other hand, requires only two images per training sample, which is a significant improvement. Additionally, an extra difficulty with the triplet and quadruplet loss is that you need to sample meaningful triplets and quadruplets which is a non-trivial task.
Want to have a chat about this project? Feel free to contact us here.