Word sense induction with agglomerative clustering and mutual information maximization - Laboratoire d'Informatique de Paris-Nord Accéder directement au contenu
Article Dans Une Revue AI Open Année : 2023

Word sense induction with agglomerative clustering and mutual information maximization

Résumé

Word sense induction (WSI) is a challenging problem in natural language processing that involves the unsupervised automatic detection of a word's senses (i.e., meanings). Recent work achieves significant results on the WSI task by pre-training a language model that can exclusively disambiguate word senses. In contrast, others employ off-the-shelf pre-trained language models with additional strategies to induce senses. This paper proposes a novel unsupervised method based on hierarchical clustering and invariant information clustering (IIC). The IIC loss is used to train a small model to optimize the mutual information between two vector representations of a target word occurring in a pair of synthetic paraphrases. This model is later used in inference mode to extract a higher-quality vector representation to be used in the hierarchical clustering. We evaluate our method on two WSI tasks and in two distinct clustering configurations (fixed and dynamic number of clusters). We empirically show that our approach is at least on par with the state-of-the-art baselines, outperforming them in several configurations. The code and data to reproduce this work are available to the public 1 .
Fichier principal
Vignette du fichier
WSI_ACL2023_submission.pdf (393.44 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04445537 , version 1 (12-02-2024)

Identifiants

Citer

Hadi Abdine, Moussa Kamal Eddine, Davide Buscaldi, Michalis Vazirgiannis. Word sense induction with agglomerative clustering and mutual information maximization. AI Open, 2023, 4, pp.193-201. ⟨10.1016/j.aiopen.2023.12.001⟩. ⟨hal-04445537⟩
6 Consultations
6 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More