To keep up with the growing interest of using Web shops for product comparison, we have developed a method that targets the problem of product duplicate detection. If duplicates can be discovered correctly and quickly, customers can compare products in an efficient manner. We build upon the state-of-the-art Multi-component Similarity Method (MSM) for product duplicate detection by developing an automated pre-processing phase that occurs before the similarities between products are calculated. Specifically, in this prior phase the features of products are aligned between Web shops, using metrics such as the data type, coverage, and diversity of each key, as well as the distribution and used measurement units of their corresponding values. With this information, the values of these keys can be more meaningfully and efficiently employed in the process of comparing products. Applying our method to a real-world dataset of 1629 TV's across 4 Web shops, we find that we increase the speed of the product similarity phase by roughly a factor 3 due to fewer meaningless comparisons, an improved brand analyzer, and a renewed title analyzer. Moreover, in terms of quality of duplicate detection, we significantly outperform MSM with an F 1-measure of 0.746 versus 0.525.
Bibliographical notePublisher Copyright:
© 2021 The Author(s)