Abstract
Due to the growing number of Web shops, aggregating product data from the Web is growing in importance. One of the problems encountered in product aggregation is duplicate detection. In this paper, we extend and significantly improve an existing state-of-the-art product duplicate detection method. Our approach employs a novel method for combining the titles' and the attributes' similarities into a final product similarity. We use q-grams to handle partial matching of words, such as abbreviations. Where existing methods cluster products of only twoWeb shops, we propose a hierarchical clustering method to handle multiple Web shops. Applying our new method to a dataset of TV's from four Web shops reveals that it significantly outperforms the Hybrid Similarity Method, the Title Model Words Method, and the well-known TF-IDF method, with an F1 score of 0:475 compared to 0:287, 0:298, and 0:335, respectively.
Original language | English |
---|---|
Title of host publication | 2015 Symposium on Applied Computing, SAC 2015 |
Editors | Dongwan Shin |
Publisher | Association for Computing Machinery |
Pages | 761-768 |
Number of pages | 8 |
ISBN (Electronic) | 9781450331968 |
DOIs | |
Publication status | Published - 13 Apr 2015 |
Event | 30th Annual ACM Symposium on Applied Computing, SAC 2015 - Salamanca, Spain Duration: 13 Apr 2015 → 17 Apr 2015 |
Publication series
Series | Proceedings of the ACM Symposium on Applied Computing |
---|---|
Volume | 13-17-April-2015 |
Conference
Conference | 30th Annual ACM Symposium on Applied Computing, SAC 2015 |
---|---|
Country/Territory | Spain |
City | Salamanca |
Period | 13/04/15 → 17/04/15 |
Bibliographical note
Publisher Copyright:Copyright 2015 ACM.