Multi-component similarity method for web product duplicate detection

Ronald Van Bezu, Sjoerd Borst, Rick Rijkse, Jim Verhagen, Damir Vandic, Flavius Frasincar

Research output: Chapter/Conference proceedingConference proceedingAcademicpeer-review

18 Citations (Scopus)
163 Downloads (Pure)

Abstract

Due to the growing number of Web shops, aggregating product data from the Web is growing in importance. One of the problems encountered in product aggregation is duplicate detection. In this paper, we extend and significantly improve an existing state-of-the-art product duplicate detection method. Our approach employs a novel method for combining the titles' and the attributes' similarities into a final product similarity. We use q-grams to handle partial matching of words, such as abbreviations. Where existing methods cluster products of only twoWeb shops, we propose a hierarchical clustering method to handle multiple Web shops. Applying our new method to a dataset of TV's from four Web shops reveals that it significantly outperforms the Hybrid Similarity Method, the Title Model Words Method, and the well-known TF-IDF method, with an F1 score of 0:475 compared to 0:287, 0:298, and 0:335, respectively.

Original languageEnglish
Title of host publication2015 Symposium on Applied Computing, SAC 2015
EditorsDongwan Shin
PublisherAssociation for Computing Machinery
Pages761-768
Number of pages8
ISBN (Electronic)9781450331968
DOIs
Publication statusPublished - 13 Apr 2015
Event30th Annual ACM Symposium on Applied Computing, SAC 2015 - Salamanca, Spain
Duration: 13 Apr 201517 Apr 2015

Publication series

SeriesProceedings of the ACM Symposium on Applied Computing
Volume13-17-April-2015

Conference

Conference30th Annual ACM Symposium on Applied Computing, SAC 2015
Country/TerritorySpain
CitySalamanca
Period13/04/1517/04/15

Bibliographical note

Publisher Copyright:
Copyright 2015 ACM.

Fingerprint

Dive into the research topics of 'Multi-component similarity method for web product duplicate detection'. Together they form a unique fingerprint.

Cite this