Insights

Group or Delete: Hashing Product Image Duplicates

Group or Delete: Hashing Product Image Duplicates

When we think of clutter, we usually think of packed drawers and cabinets throughout our home. One might not associate this issue with data. But in reality, like home utensils, data also takes up space and if not maintained can get just as messy. 

Making sure our clients start 2023 with their “ducks (in our case data) in a row” has been of focus for our lead developer and architect Andrei Vishneuski, especially as it pertains to our clients with robust assets and extensive image libraries. 

Many of our retail clients have found that for one product there can be hundreds of images that aren’t tagged under the same product or identifiers because the image is cropped, has a different filter, is shot at a different angle, or has the color skewed. But in reality, the images are all for the same one product. 

If not addressed, the image library can become filled with duplicate images and assets, making the image library disorganized, cluttered, and difficult to search. To remedy this issue, Crystalloids is investigating “image perceptual hashing”. 

According to Andrei, when a leading luxury retailer complained about a growing number of duplicate images in their library that are being read as a “new product”, he realized that this issue could be solved by implementing a sophisticated algorithmic solution via BigQuery. 

“This retailer needs a way to organize their picture collection but currently doesn't have a way to detect similar pictures that are not a pixel by pixel comparison, “ explains Andrei, noting that this is where a solution like image perceptual hashing can work. 

What is Image Perceptual Hashing?

To better understand where the need for a solution like image perceptual hashing fits best Andrei provided the following analogy:

“Suppose you have two photos of the same person but each is slightly different (but it’s the same person in the picture). Current traditional pixel by pixel comparison methods will not read or identify the images as similar” Andrei says. This is where duplication of image assets and unnecessary variations begins in your library. 

“But when you implement a perceptual algorithm into your image dataset, you get in return a score for the images and how close the images are to each other - so although the conditions of the picture could be different - the actual subject of the picture is considered and ranked.“

In layman terms, “like our brain - we don’t process image comparison and similarity from pixel to pixel analysis but we recognize the closeness of the images by shape, color, etc. so image hashing allows for a way to detect picture similarity that is not a pixel by pixel comparison” Andrei shares. 

According to Andrei, the perceptual image hashing algorithm will also recognize and relate an image in your library even if it has been cropped and saved with a different resolution or format (ex: JPEG; TIFF; PNG). While possible to run on other data warehouses, image hashing works best on BigQuery. 

“Currently, we run this algorithm on BigQuery - we take a batch or collection of pictures and send it (the meta information) to the demo system (a PubSub) this is then stored and transferred to a BigQuery table and visualized via Looker Studio” says Andrei who is currently implementing this solution for a high-level luxury retail client that works on Google’s Cloud Platform.

“Without BigQuery it’s harder to implement an algorithm like this - BigQuery can handle so much more data than other data warehouse services and you can run more complicated algorithms such as the one needed for image perceptual hashing,” notes Andrei.

How is Image Hashing Implemented?

Although intricate, as our lead developer, Andrei was able to outline the key steps when it comes to running an image hashing algorithm and then visualizing the results. 

At a high-level, perceptual image hashing is executed with the following steps: 

1:  Collect the  image information from the client via PubSub (minimal detail needed is the url to the picture for the http address)

2:  Process then download image information from PubSub on BigQuery 

 

3:  Calculate perceptual hashes by running  Cloud function and then store image information along with calculated perceptual hashes on BigQuery  

4:  The calculated results are broken into the following image similarity categories with a numerical score:

 - Transformed images
- Cropped images 
- Re-colored images


Scores that rank as close to zero as possible reflect near perfect image similarity as detected by the algorithm on BigQuery

5: Visualize final calculations and image similarity scores on Looker Studio

Group or Delete: Hashing Product Image Duplicates

Do I Need This? Can I Implement This?

Image perceptual hashing can serve any data-driven company that handles high amounts of image data and has an extensive image library that needs to always be updated, organized, and well-cataloged. 

“Perceptual hashing allows you to detect similar images even if they are not completely the same. For instance cropped images, re-colored or transformed images. This can be important for keeping image collection in order, avoiding duplicated tracking, and  having different representations of one product in your catalog” explains Andrei. 

When considering the loads of image data that would need to be processed in order to run image perceptual hashing, it is advised to use the data warehouse capabilities of BigQuery. 

“The heart and uniqueness of the implementation is due to BigQuery. Perceptual hashing is possible because of the data storage and processing power of BigQuery. Plus, visualization is important. To show how this algorithm works we use DataStudio which is great for showing how image perceptual hashing connects and identifies the same images,” notes Andrei. 

Conclusion

Currently, our team is testing this solution for implementation with a retail client that has a substantial online shop offering. But the overall benefits of such a function can be useful for any of our past, current, and future clients. 

In a nutshell, the solution can resolve the following case studies: 

  • To organize and keep image collection in proper order by avoiding duplicates and same product catalog variation
  • Theoretically this approach can be used as a recommendation algorithm. A buyer who has ordered a product can be interested in buying similar-looking products. The similarity can be computed based on perceptual hashes.  

Our team will continue to explore what can be leveraged with image perceptual hashing, including how it can serve as a recommendation algorithm, in a follow-up article. For now, if you have any questions about how this solution can improve your internal image data library, or if you’d like to have your image library optimized for the new year– feel free to reach out.