TreeCloud & Unitex: an increased synergy

Claude Martineau

Résumé

Given two words A and B and: • O 11 , observed number of sliding windows containing both A and B • O 12 , observed number of sliding windows containing A but not B • O 21 , observed number of sliding windows not containing A but B • O 22 , observed number of sliding windows containing neither A nor B the following variables are defined: • R 1 = O 11 + O 12 , number of sliding windows containing A • R 2 = O 21 + O 22 , number of sliding windows not containing A • C 1 = O 11 + O 21 , number of sliding windows containing B • C 2 = O 12 + O 22 , number of sliding windows not containing B • N = R 1 + R 2 = C 1 + C 2 , number of sliding windows • E 11 = (R 1 C 1 /N), expected number of sliding windows containing both A and B • E 12 = (R 1 C 2 /N), expected number of sliding windows containing A but not B • E 21 = (R 2 C 1 /N), expected number of sliding windows not containing A but B • E 22 = (R 2 C 2 /N), expected number of sliding windows containing neither A nor B The definitions of co-occurrence formulas are the following: • jaccard: 1-O 11 / (O 11 + O 12 + O 21) • liddell: 1-(O 11 O 22-O 12 O 21) / (C 1 C 2) • dice: 1-2O 11 / (R 1 + C 1) • hyperlex: 1-max(O 11 / R 1 ,O 11 / C 1) • poissonstirling: O 11 (log O 11-log E 11-1) • chisquared: 1000-N(O 11-E 11) 2 / (E 11 E 22) • zscore: 1-(O 11-E 11) / sqr(E 11) • ms: 1-min(O 11 / R 1, O 11 / C 1) • oddsratio: 1-log((O 11 O 22) / (O 12 O 21)) • loglikelihood: 1-2(O 11 log(O 11 / E 11) + O 12 log(O 12 / E 12) + O 21 log(O 21 / E 21) + O 22 log(O 22 / E 22)) • gmean: 1-O 11 /sqr(R 1 C 1) = 1-O 11 /sqr(NE 11) • mi (mutual information): 1-log(O11/E 11) • ngd (normalized Google distance): (max(log R 1 ,log C 1)-log O 11) / (N-min(log R 1 ,log C 1)) TreeCloud builds a tree cloud visualization of a text, which looks like a tag cloud where the tags are displayed around a tree to reflect the co-occurrence distance between the words in the text. avocat,avocat.N+Hum+Prof:ms avocate,avocat.N+Hum+Prof:fs avocats,avocat. N+Hum+Prof:mp avocates,avocat.N+Hum+Prof:fp avocat d'affaires,avocat d'affaires.N+Hum+Prof:ms avocate d'affaires,avocat d'affaires.N+Hum+Prof:fs avocats d'affaires,avocat d'affaires. N+Hum+Prof:mp avocates d'affaires,avocat d'affaires.N+Hum+Prof:fp Several ways to use Unitex/GramLab Unitex-GramLab is a corpus processing suite [MATCH] Unitex-GramLab is an open source corpus processing suite [MATCH] Unitex-GramLab is a hard to learn corpus processing suite [FAIL] Unitex-GramLab is [FAIL] 1 inflected form 2 ,canonical form 3 .grammatical category 4 +semantic attributes 5 :inflectional information (m: masculine, f: feminine, s: singular, p: plural) business lawyer Unitex/GramLab is a corpus analyser and annotation tool • Based on Automata and RTNs with outputs • Multilingual: Up to 22 languages (French, English,..., Greek, ... , Korean, Thaï) • Unicode 3.0 (UTF8, UTF16LE, UTF16BE) • Cross-platform: Linux, macOS, Windows • Open source: https://github.com/UnitexGramLab • Website and binary installers: http://unitexgramlab.org • Under development since 2001 by a group of passionate volunteers Unitex/GramLab uses linguistic resources: • DELA (LADL electronic dictionaries) A typical DELA entry is composed by a simple or compound inflected form, followed by a lemma and grammatical information. Each entry can be associated with syntactic and semantic attributes and inflection rules: inflected_form,lemma.grammatical_information+attributes:inflection_rule Example: Given the French simple word "avocat" (lawyer) and the compound word "avocat d'affaires" (business lawyer), a DELA representation would be: • Syntactic or semantic rules called «local grammars» represented by graphs • Graphical representations of local grammars are composed by a set of linked boxes. • A successful path is a path between initial and final states. TreeCloud is a tree cloud visualization of a text The grammar below contains two search paths: • an adverb () ending in-ly followed by a past participle () • a noun () followed by a verb in progressive form ( ) A lexical mask like refers to the text dictionary. The recognized sequences are surrounded by the tag . The results are represented in the form of concordances. Some examples of matched and unmatched sequences by the above grammar: Two interfaces written in JAVA: • Unitex IDE (classic) • GramLab IDE (project-oriented) Unitex Core written in C/C++ Text dictionary Application of a dictionary; the result is the text dictionary, then application of a local grammar They refer to Command lines or system calls with Perl, Python, etc. Use the API C and JAVA (JNI) that provides access to • a virtual file system • a persistence layer for resources (alphabets, dictionaries and corpora) How and Why to plug Unitex into TreeCloud? Take advantage of the work already done by Unitex Unitex/GramLab analysis steps Normalize Tokenize Dico Locate Concord created files Concordances Annotated text program called dlf, dlc, err tokens.txt, text.cod concord.ind At the end of the Unitex analysis process, text.snt contains a cleaned text (normalization of separator characters), text.cod contains the list of indexes of the tokens into the tokens.txt file list. dlf, dlc, err, respectively contain simple words, compound words, unknown words concord.ind contains the matched sequences with their position into the text (XXX, and multiword units) To get the «new text», we retokenize the text with matched sequences of the concord.ind file as the new tokens of the text. New token.txt and text.cod files are created. This process prevents double reading of the text and double division into words. Thanks to the Unitex API and virtual file system, all this work is done in memory.

Le logiciel TreeCloud développé initialement par Philippe Gambette et Jean Veronis permet donner une vue d’ensemble d’un texte sous la forme d’un nuage de mots. Dans les nuages de mots « classiques », seule la fréquence des mots est utilisée pour rendre compte de leur importance dans le texte, en faisant varier la taille de police et/ou de la couleur des mots. TreeCloud apporte une information supplémentaire en disposant les mots les plus fréquents dans un arbre. La proximité des mots dans le nuage arboré reflète celle qu’ils ont dans le texte. Pour que les mots les plus fréquents conservés dans le nuage arboré soient porteur d’informations il faut filtrer les mots grammaticaux (déterminant, préposition, pronom, verbes auxiliaire, etc) à l’aide d’un antidictionnaire. L’utilisation d’Unitex permet de représenter plus efficacement et d’étendre la couverture de l’antidictionnaire. Unitex effectue un prétraitement du texte par l’application de dictionnaires et d’une grammaire locale qui reconnaît dans le texte (source) à analyser les catégories grammaticales ou les formes non souhaitées. En outre, il est désormais possible de mettre en place plus aisément des traitements qui permettent de visualiser dans le nuage produit certaines information comme la catégorie grammaticale d’un mot ou des multimots comme des noms composés ou des entités nommées, par exemple des noms de personne.

TreeCloud & Unitex: an increased synergy

TreeCloud & Unitex: une synergie accrue

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager