Find the Assembly Mistakes: Error Segmentation for Industrial Applications

*equal contribution (ordered by coin-flip)

1Eindhoven University of Technology, 2ASML Research

Published in: ECCV 2024 (Vision-based InduStrial InspectiON)


TLDR: Applying Change Detection Algorithms for Error Segmentation

Abstract

Recognizing errors in assembly and maintenance procedures is valuable for industrial applications, since it can increase worker efficiency and prevent unplanned down-time. Although assembly state recognition is gaining attention, none of the current works investigate assembly error localization. Therefore, we propose StateDiffNet, which localizes assembly errors based on detecting the differences between a (correct) intended assembly state and a test image from a similar viewpoint. StateDiffNet is trained on synthetically generated image pairs, providing full control over the type of meaningful change that should be detected.

The proposed approach is the first to correctly localize assembly errors taken from real ego-centric video data for both states and error types that are never presented during training. Furthermore, the deployment of change detection to this industrial application provides valuable insights and considerations into the mechanisms of state-of-the-art change detection algorithms. The code and data generation pipeline are publicly available.

Video Demonstration


Note: website is currently under construction! Status:

  • [Soon] Datasets used for training and evaluation
  • [12 Sept 2024] Video demo
  • [26 Aug 2024] ArXiv paper
  • [23 Aug 2024] GitHub repo with Unity code
  • [23 Aug 2024] GitHub repo with Python code

Motivation

Many works perform some type of assembly state detection or assembly quality inspection, which are typically performed as a supervised classification or detection problem. Though these approaches show decent performance at classifying a small number of assembly states, performance on erroneous states is not investigated. A representation-learning based approach, rather than a classification based approach, is demonstrated to be capable of recognizing unseen assembly errors, a fundamental requirement for a viable system. However, this approach lacks spatial localization of errors, and therefore cannot provide interpretable feedback. Another type of approach relies on anomaly detection, which requires a model to learn the typical structures of nominal data, based on a-priori definitions of nominal data. This is not feasible for industrial procedures, since the possible presence of an error in assembly states depends on the intended state at that moment of the assembly.

To overcome the fundamental limitations that arise for error localization in the industrial domain, we propose a methodology that pinpoints the difference between an object in different assembly (including erroneous) states, using segmentation. Our approach is the first error localization system that can locate errors on states that were never encountered during training, and on much more complex assembly configurations than related works. We use more than 10^5 unique state combinations, including states with very small differences, and test on entirely unseen states.

Given two images of an assembly object, our model segments all meaningful differences in the state that can be inferred from its view of the object. A core component of the proposed approach is the methodology for generating and sampling synthetic image pairs, which provides full control over the meaningful change that the system should detect, as well as the changes that the system should be invariant to, i.e. the expected variability resulting from aspects such as camera pose, photometry, image distortions, and shadows.


Overview of the methodology

The base architecture is based on [1] and modified to perform segmentation, rather than object detection.

description of what image is showing.

Proposed change segmentation architecture, modified from [1], consisting of a Siamese encoder, cross-attention based feature-fusion blocks as skip connections in the U-Net-style decoder, and a two-layer convolutional segmentation head.

[1] Sachdeva, Ragav, and Andrew Zisserman. "The change you want to see." WACV 2023.


Image pair selection

Our approach is trained entirely on synthetic data, generated by modifying the Unity Perception package. The code for this is publicly available via this website. The image pairs are selected as follows:

description of what image is showing.

Overview of the training process. The ground-truth binary change mask of an image pair is created by taking the difference between the instance segmentation masks of the anchor and the sample image viewed from the camera angle of the anchor.


Results on real-world data

While the models are trained exclusively on synthetic data, some configurations are able to perform error- segmentation on real-world data:

description of what image is showing.

Real-world qualitative results on different assembly errors. The global cross-attention based model performs best, particularly on missing parts, components, and mis-orientations. No model is able to reliably detect placement errors (last row).


BibTeX

@inproceedings{lehman2024find,
  title={Find the Assembly Mistakes: Error Segmentation for Industrial Applications},
  author={Lehman, Dan and Schoonbeek, Tim J and Hung, Shao-Hsuan and Kustra, Jacek and de With, Peter HN and van der Sommen, Fons},
  booktitle={arXiv preprint arXiv:2408.12945},
  year={2024}
}