|
Conceptometry: A Quantitative Framework for Measuring Conceptual Richness and Complexity in Texts
Pré-publication, Document de travail hal-05205705v1 Document
En modération
2025-08-10
2025
|
|
|
https://explore.openaire.eu/search/publication?pid=10.5281%2Fzenodo.16789573 Luigi Usai. Conceptometry: A Quantitative Framework for Measuring Conceptual Richness and Complexity in Texts. 2025. ⟨hal-05205705⟩ [1]L. Usai, «L’invenzione della Concettometria». Zenodo, ago. 10, 2025. doi: 10.5281/zenodo.16789573. |
Come funzionerebbe un software di analisi concettometrica
Un buon analizzatore di Concettometria segue una pipeline chiara, dalla pulizia del testo fino alle metriche finali e alle visualizzazioni.
Architettura a pipeline
- Ingestione e normalizzazione
- Input: testo (IT/EN), selezione lingua.
- Pulizia base: rimozione di spazi ridondanti, normalizzazione Unicode, split per paragrafi.
- Analisi linguistica (NLP)
- Tokenizzazione, PoS tagging, lemmatizzazione, dipendenze.
- Estrazione candidati concetto: noun chunks multi-parola, NOUN/PROPN lemmatizzati.
- Canonicalizzazione dei concetti
- Lowercasing, rimozione stopword interne (“di”, “the”), deduplicazione tra singoli lemmi e chunk.
- Valutazione di complessità dei concetti
- Fattore di profondità semantica Fd(c)F_d(c):
- EN: profondità massima in WordNet (ipernimi), normalizzata.
- IT: proxy di rarità (word frequency) come surrogato della profondità.
- Fattore di astrazione Fa(c)F_a(c):
- EN: vicinanza al ramo “abstraction” in WordNet e suffissi (-ness, -ity, -ism, -tion, -ment).
- IT: suffissi astrattivi (-ità, -zione, -tudine, -enza, -mento, -ismo, -logia, -ica) e penalizzazione dei nomi propri.
- Fattore di profondità semantica Fd(c)F_d(c):
- Calcolo metriche
- Densità Concettuale Grezza: DCg=∣C∣N\mathrm{DCg}=\frac{|C|}{N}.
- Peso concetto: w(c)=α⋅Fd(c)+β⋅Fa(c)w(c)=\alpha \cdot F_d(c)+\beta \cdot F_a(c) (normalizzato in [0,1][0,1]).
- Densità Concettuale Ponderata: DCp=∑c∈Cw(c)N\mathrm{DCp}=\frac{\sum_{c\in C} w(c)}{N}.
- Indice di Ridondanza Concettuale: IRC=1−∣C∣∑cf(c)\mathrm{IRC}=1-\frac{|C|}{\sum_{c} f(c)}.
- Efficienza Informativa: EI=DCp⋅(1−IRC)\mathrm{EI}=\mathrm{DCp}\cdot(1-\mathrm{IRC}).
- Visualizzazione e report
- Tabella top concetti per peso/frequenza.
- Grafico barre dei concetti più “pesanti”.
- Esportazione CSV/JSON.
Scelte progettuali chiave
- Estrarre concetti a livello di costituenti nominali è più robusto dei singoli lemmi isolati.
- L’uso di WordNet (EN) e di euristiche morfologiche/frequenziali (IT) rende l’approccio praticabile con buona copertura.
- La separazione tra pipeline NLP e calcolo metriche facilita estensioni future (altri pesi, lingue, ontologie).
Implementazione di riferimento (Python + GUI Tkinter)
Di seguito un prototipo “tutto-in-uno” con:
- NLP: spaCy (it_core_news_sm / en_core_web_sm)
- Profondità/astrazione: NLTK WordNet (EN), euristiche (IT), wordfreq
- GUI: Tkinter (Text area, combo lingua, tabella risultati, grafico matplotlib)
Prerequisiti (una sola volta):
- pip install spacy nltk wordfreq matplotlib
- python -m spacy download it_core_news_sm
- python -m spacy download en_core_web_sm
- In Python: import nltk; nltk.download(‘wordnet’); nltk.download(‘omw-1.4’)
import re
import tkinter as tk
from tkinter import ttk, scrolledtext, messagebox, filedialog
from collections import Counter, defaultdict
import spacy
from spacy.lang.it.stop_words import STOP_WORDS as IT_STOP
from spacy.lang.en.stop_words import STOP_WORDS as EN_STOP
import matplotlib
matplotlib.use("TkAgg")
import matplotlib.pyplot as plt
from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg
from wordfreq import zipf_frequency
# NLTK WordNet (EN only)
from nltk.corpus import wordnet as wn
# -----------------------
# Language resources
# -----------------------
NLP_MODELS = {}
STOPWORDS = {
"it": IT_STOP,
"en": EN_STOP
}
ABSTRACT_SUFFIXES_IT = ("ità", "zione", "tudine", "enza", "mento", "ismo", "logia", "ica")
ABSTRACT_SUFFIXES_EN = ("ness", "ity", "ism", "tion", "ment", "ship", "hood", "acy")
def get_nlp(lang):
if lang not in NLP_MODELS:
if lang == "it":
NLP_MODELS[lang] = spacy.load("it_core_news_sm")
elif lang == "en":
NLP_MODELS[lang] = spacy.load("en_core_web_sm")
else:
raise ValueError("Lingua non supportata.")
return NLP_MODELS[lang]
# -----------------------
# Concept extraction
# -----------------------
def normalize_space(text):
return re.sub(r"\s+", " ", text.strip())
def strip_stopwords_inside(tokens, lang):
stop = STOPWORDS.get(lang, set())
return [t for t in tokens if t not in stop]
def noun_chunk_to_lemma(chunk):
# Join lemmas of tokens that are part of the chunk and are not punct/space
lemmas = []
for t in chunk:
if t.is_space or t.is_punct:
continue
lemmas.append(t.lemma_.lower())
return " ".join(lemmas)
def extract_concepts(doc, lang):
# Candidate concepts: noun chunks + NOUN/PROPN lemmas
candidates = []
# Multi-word noun chunks
for ch in doc.noun_chunks:
lemma_chunk = noun_chunk_to_lemma(ch)
# Remove internal stopwords for multiword normalization
tokens = [w for w in lemma_chunk.split() if w.isalpha()]
tokens = strip_stopwords_inside(tokens, lang)
if tokens:
candidates.append(" ".join(tokens))
# Single nouns and proper nouns
for tok in doc:
if tok.pos_ in ("NOUN", "PROPN") and not tok.is_stop and tok.is_alpha:
candidates.append(tok.lemma_.lower())
# Clean and dedupe
cleaned = []
for c in candidates:
c = c.strip()
c = re.sub(r"\s+", " ", c)
if c:
cleaned.append(c)
# Prefer multi-word terms over overlapping single words:
# Keep both but frequency will naturally reflect salience.
counts = Counter(cleaned)
return counts
# -----------------------
# Complexity factors
# -----------------------
def normalize(x, lo, hi):
if hi <= lo:
return 0.0
v = (x - lo) / (hi - lo)
return max(0.0, min(1.0, v))
def fd_semantic_depth(concept, lang):
"""
Fd(c): semantic depth proxy.
EN: use WordNet max hypernym depth of head lemma (approx: last token).
IT: proxy via rarity (zipf frequency).
"""
head = concept.split()[-1] # heuristic: head as last token
if lang == "en":
synsets = wn.synsets(head, pos=wn.NOUN)
if synsets:
depths = []
for s in synsets:
try:
depths.append(s.max_depth())
except Exception:
pass
if depths:
# WordNet noun depth typical range ~0..20
return normalize(max(depths), 0, 20)
# Fallback to rarity proxy
z = zipf_frequency(head, "en")
return normalize(7 - z, 0, 7) # higher when rarer
else:
# IT: rarity proxy
z = zipf_frequency(head, "it")
return normalize(7 - z, 0, 7)
def fa_abstraction(concept, lang):
"""
Fa(c): abstraction factor.
EN: check affiliation to 'abstraction' branch or abstract suffixes; multiword length boost.
IT: suffix heuristics for abstraction; penalize proper names (titlecase single-token).
"""
tokens = concept.split()
head = tokens[-1] if tokens else concept
# Multiword boost (abstract notions are often multiword/technical)
multiword_bonus = normalize(len(tokens), 1, 4) * 0.3 # up to +0.3
if lang == "en":
# WordNet abstraction lineage
abstract_score = 0.0
synsets = wn.synsets(head, pos=wn.NOUN)
if synsets:
for s in synsets:
try:
for path in s.hypernym_paths():
if any((ss.name().startswith("abstraction.n.") for ss in path)):
abstract_score = max(abstract_score, 0.7) # strong hint of abstraction
except Exception:
pass
# Suffix heuristic
if head.endswith(ABSTRACT_SUFFIXES_EN):
abstract_score = max(abstract_score, 0.6)
return min(1.0, abstract_score + multiword_bonus)
else:
# IT: suffix heuristic
abstract_score = 0.0
if head.endswith(ABSTRACT_SUFFIXES_IT):
abstract_score = max(abstract_score, 0.6)
# Penalize likely proper names (single token titlecase) by reducing abstraction
if len(tokens) == 1 and head.istitle():
abstract_score = max(abstract_score - 0.2, 0.0)
return min(1.0, abstract_score + multiword_bonus)
def concept_weight(concept, lang, alpha=0.5, beta=0.5):
fd = fd_semantic_depth(concept, lang)
fa = fa_abstraction(concept, lang)
w = alpha * fd + beta * fa
return max(0.0, min(1.0, w)), fd, fa
# -----------------------
# Metrics
# -----------------------
def compute_metrics(text, lang):
nlp = get_nlp(lang)
doc = nlp(text)
# total tokens excluding punct/space
N = sum(1 for t in doc if (not t.is_space and not t.is_punct))
concept_counts = extract_concepts(doc, lang)
C_unique = len(concept_counts)
total_mentions = sum(concept_counts.values()) if concept_counts else 0
# DCg
DCg = (C_unique / N) if N > 0 else 0.0
# Weights
concept_data = []
total_weight = 0.0
for c, freq in concept_counts.items():
w, fd, fa = concept_weight(c, lang)
total_weight += w
concept_data.append({
"concept": c,
"freq": freq,
"w": w,
"fd": fd,
"fa": fa
})
# DCp
DCp = (total_weight / N) if N > 0 else 0.0
# IRC: 1 - |C| / total_mentions
IRC = 0.0
if total_mentions > 0:
IRC = 1.0 - (C_unique / total_mentions)
IRC = max(0.0, min(1.0, IRC))
# EI
EI = DCp * (1.0 - IRC)
# Sort for display
concept_data.sort(key=lambda x: (x["w"], x["freq"]), reverse=True)
return {
"N": N,
"C_unique": C_unique,
"total_mentions": total_mentions,
"DCg": DCg,
"DCp": DCp,
"IRC": IRC,
"EI": EI,
"concepts": concept_data
}
# -----------------------
# GUI
# -----------------------
class ConceptometryApp:
def __init__(self, root):
self.root = root
root.title("Concettometria — Analisi")
root.geometry("1100x750")
self.lang_var = tk.StringVar(value="it")
# Controls frame
top = ttk.Frame(root, padding=10)
top.pack(side=tk.TOP, fill=tk.X)
ttk.Label(top, text="Lingua:").pack(side=tk.LEFT)
self.lang_combo = ttk.Combobox(top, textvariable=self.lang_var, values=["it", "en"], width=5, state="readonly")
self.lang_combo.pack(side=tk.LEFT, padx=5)
ttk.Button(top, text="Analizza", command=self.run_analysis).pack(side=tk.LEFT, padx=5)
ttk.Button(top, text="Carica file…", command=self.load_file).pack(side=tk.LEFT, padx=5)
ttk.Button(top, text="Esporta CSV", command=self.export_csv).pack(side=tk.LEFT, padx=5)
# Text input
self.text_area = scrolledtext.ScrolledText(root, wrap=tk.WORD, height=12, font=("Segoe UI", 11))
self.text_area.pack(fill=tk.BOTH, expand=False, padx=10, pady=5)
# Metrics frame
self.metrics_frame = ttk.LabelFrame(root, text="Metriche", padding=10)
self.metrics_frame.pack(fill=tk.X, padx=10, pady=5)
self.metrics_vars = {
"N": tk.StringVar(value="-"),
"C_unique": tk.StringVar(value="-"),
"total_mentions": tk.StringVar(value="-"),
"DCg": tk.StringVar(value="-"),
"DCp": tk.StringVar(value="-"),
"IRC": tk.StringVar(value="-"),
"EI": tk.StringVar(value="-")
}
grid = ttk.Frame(self.metrics_frame)
grid.pack(fill=tk.X)
row = 0
for label, key in [("Token (N)", "N"),
("Concetti unici (|C|)", "C_unique"),
("Menzioni concetti", "total_mentions"),
("DCg", "DCg"),
("DCp", "DCp"),
("IRC", "IRC"),
("EI", "EI")]:
ttk.Label(grid, text=label + ":").grid(row=row, column=0, sticky="w", padx=5, pady=2)
ttk.Label(grid, textvariable=self.metrics_vars[key]).grid(row=row, column=1, sticky="w", padx=5, pady=2)
row += 1
# Table of concepts
self.table_frame = ttk.LabelFrame(root, text="Concetti (Top)", padding=10)
self.table_frame.pack(fill=tk.BOTH, expand=True, padx=10, pady=5)
cols = ("concept", "freq", "w", "fd", "fa")
self.tree = ttk.Treeview(self.table_frame, columns=cols, show="headings", height=10)
headings = {
"concept": "Concetto",
"freq": "Frequenza",
"w": "Peso w(c)",
"fd": "Fd",
"fa": "Fa"
}
for c in cols:
self.tree.heading(c, text=headings[c])
self.tree.column(c, anchor="w", width=160 if c == "concept" else 100)
self.tree.pack(fill=tk.BOTH, expand=True)
# Chart
self.chart_frame = ttk.LabelFrame(root, text="Top concetti per peso", padding=10)
self.chart_frame.pack(fill=tk.BOTH, expand=True, padx=10, pady=5)
self.figure = plt.Figure(figsize=(7, 3), dpi=100)
self.ax = self.figure.add_subplot(111)
self.canvas = FigureCanvasTkAgg(self.figure, master=self.chart_frame)
self.canvas.get_tk_widget().pack(fill=tk.BOTH, expand=True)
# Data
self.last_results = None
def run_analysis(self):
text = normalize_space(self.text_area.get("1.0", tk.END))
if not text:
messagebox.showwarning("Attenzione", "Inserisci del testo da analizzare.")
return
lang = self.lang_var.get()
try:
results = compute_metrics(text, lang)
except Exception as e:
messagebox.showerror("Errore", f"Analisi fallita:\n{e}")
return
self.last_results = results
self.update_metrics(results)
self.update_table(results)
self.update_chart(results)
def update_metrics(self, res):
self.metrics_vars["N"].set(res["N"])
self.metrics_vars["C_unique"].set(res["C_unique"])
self.metrics_vars["total_mentions"].set(res["total_mentions"])
self.metrics_vars["DCg"].set(f"{res['DCg']:.4f}")
self.metrics_vars["DCp"].set(f"{res['DCp']:.4f}")
self.metrics_vars["IRC"].set(f"{res['IRC']:.4f}")
self.metrics_vars["EI"].set(f"{res['EI']:.4f}")
def update_table(self, res, top_k=30):
for row in self.tree.get_children():
self.tree.delete(row)
for item in res["concepts"][:top_k]:
self.tree.insert("", "end", values=(
item["concept"],
item["freq"],
f"{item['w']:.3f}",
f"{item['fd']:.3f}",
f"{item['fa']:.3f}"
))
def update_chart(self, res, top_k=10):
self.ax.clear()
data = res["concepts"][:top_k]
if not data:
self.canvas.draw()
return
labels = [d["concept"] for d in data]
weights = [d["w"] for d in data]
bars = self.ax.barh(range(len(labels)), weights, color="#3b82f6")
self.ax.set_yticks(range(len(labels)))
self.ax.set_yticklabels(labels)
self.ax.invert_yaxis()
self.ax.set_xlabel("Peso w(c)")
self.ax.set_xlim(0, 1)
for i, b in enumerate(bars):
self.ax.text(b.get_width() + 0.01, b.get_y() + b.get_height()/2, f"{weights[i]:.2f}", va="center")
self.figure.tight_layout()
self.canvas.draw()
def load_file(self):
fp = filedialog.askopenfilename(filetypes=[("Testo", "*.txt"), ("Tutti i file", "*.*")])
if not fp:
return
try:
with open(fp, "r", encoding="utf-8") as f:
content = f.read()
self.text_area.delete("1.0", tk.END)
self.text_area.insert(tk.END, content)
except Exception as e:
messagebox.showerror("Errore", f"Impossibile aprire il file:\n{e}")
def export_csv(self):
if not self.last_results:
messagebox.showwarning("Attenzione", "Esegui prima un'analisi.")
return
fp = filedialog.asksaveasfilename(defaultextension=".csv", filetypes=[("CSV", "*.csv")])
if not fp:
return
try:
import csv
with open(fp, "w", encoding="utf-8", newline="") as f:
writer = csv.writer(f, delimiter=";")
writer.writerow(["metric", "value"])
for k in ["N", "C_unique", "total_mentions", "DCg", "DCp", "IRC", "EI"]:
writer.writerow([k, self.last_results[k]])
writer.writerow([])
writer.writerow(["concept", "freq", "w", "fd", "fa"])
for item in self.last_results["concepts"]:
writer.writerow([item["concept"], item["freq"], f"{item['w']:.6f}", f"{item['fd']:.6f}", f"{item['fa']:.6f}"])
messagebox.showinfo("OK", "Esportazione completata.")
except Exception as e:
messagebox.showerror("Errore", f"Esportazione fallita:\n{e}")
def main():
root = tk.Tk()
app = ConceptometryApp(root)
root.mainloop()
if __name__ == "__main__":
main()
Come usarlo in pratica
- Incolla il testo di Borges (“La biblioteca di Babele”) nella text area.
- Seleziona lingua “it” o “en” in base alla versione del testo.
- Clicca “Analizza” per vedere:
- DCg, DCp, IRC, EI.
- I concetti più pesanti (con Fd, Fa).
- Un grafico dei top concetti.
Suggerimenti:
- Per confronti, analizza sezioni diverse (es. primi 3 paragrafi vs ultimi 3).
- Esporta CSV per integrare i risultati nel tuo paper.
Limiti e possibili estensioni
- WordNet copre bene l’inglese; per l’italiano usiamo euristiche. Estensione ideale: Open Multilingual WordNet o BabelNet per profondità semantica IT.
- Migliorare Fa con un lessico di concretezza (es. Brysbaert) o modelli neurali di astrazione.
- Aggiungere una finestra scorrevole su frasi/paragrafi per mappare la distribuzione concettuale lungo il testo.
- Integrare keyphrase extraction più sofisticata (TextRank/YAKE) e disambiguazione di senso (WSD).