• Sab. Gen 17th, 2026

Atlantis is real: Official discovery of Atlantis, language and migrations

Atlantis is the Sardo Corso Graben Horst underwater continental block submerged by the Meltwater Pulses and destroyed by a subduction zone, Capital is Sulcis

Spread the love
Document
En modération
2025-08-10
2025

https://explore.openaire.eu/search/publication?pid=10.5281%2Fzenodo.16789573

Luigi Usai. Conceptometry: A Quantitative Framework for Measuring Conceptual Richness and Complexity in Texts. 2025. ⟨hal-05205705⟩

[1]L. Usai, «L’invenzione della Concettometria». Zenodo, ago. 10, 2025. doi: 10.5281/zenodo.16789573.

https://zenodo.org/records/16789573

Come funzionerebbe un software di analisi concettometrica

Un buon analizzatore di Concettometria segue una pipeline chiara, dalla pulizia del testo fino alle metriche finali e alle visualizzazioni.

Architettura a pipeline

  1. Ingestione e normalizzazione
    • Input: testo (IT/EN), selezione lingua.
    • Pulizia base: rimozione di spazi ridondanti, normalizzazione Unicode, split per paragrafi.
  2. Analisi linguistica (NLP)
    • Tokenizzazione, PoS tagging, lemmatizzazione, dipendenze.
    • Estrazione candidati concetto: noun chunks multi-parola, NOUN/PROPN lemmatizzati.
  3. Canonicalizzazione dei concetti
    • Lowercasing, rimozione stopword interne (“di”, “the”), deduplicazione tra singoli lemmi e chunk.
  4. Valutazione di complessità dei concetti
    • Fattore di profondità semantica Fd(c)F_d(c):
      • EN: profondità massima in WordNet (ipernimi), normalizzata.
      • IT: proxy di rarità (word frequency) come surrogato della profondità.
    • Fattore di astrazione Fa(c)F_a(c):
      • EN: vicinanza al ramo “abstraction” in WordNet e suffissi (-ness, -ity, -ism, -tion, -ment).
      • IT: suffissi astrattivi (-ità, -zione, -tudine, -enza, -mento, -ismo, -logia, -ica) e penalizzazione dei nomi propri.
  5. Calcolo metriche
    • Densità Concettuale Grezza: DCg=∣C∣N\mathrm{DCg}=\frac{|C|}{N}.
    • Peso concetto: w(c)=α⋅Fd(c)+β⋅Fa(c)w(c)=\alpha \cdot F_d(c)+\beta \cdot F_a(c) (normalizzato in [0,1][0,1]).
    • Densità Concettuale Ponderata: DCp=∑c∈Cw(c)N\mathrm{DCp}=\frac{\sum_{c\in C} w(c)}{N}.
    • Indice di Ridondanza Concettuale: IRC=1−∣C∣∑cf(c)\mathrm{IRC}=1-\frac{|C|}{\sum_{c} f(c)}.
    • Efficienza Informativa: EI=DCp⋅(1−IRC)\mathrm{EI}=\mathrm{DCp}\cdot(1-\mathrm{IRC}).
  6. Visualizzazione e report
    • Tabella top concetti per peso/frequenza.
    • Grafico barre dei concetti più “pesanti”.
    • Esportazione CSV/JSON.

Scelte progettuali chiave

  • Estrarre concetti a livello di costituenti nominali è più robusto dei singoli lemmi isolati.
  • L’uso di WordNet (EN) e di euristiche morfologiche/frequenziali (IT) rende l’approccio praticabile con buona copertura.
  • La separazione tra pipeline NLP e calcolo metriche facilita estensioni future (altri pesi, lingue, ontologie).

Implementazione di riferimento (Python + GUI Tkinter)

Di seguito un prototipo “tutto-in-uno” con:

  • NLP: spaCy (it_core_news_sm / en_core_web_sm)
  • Profondità/astrazione: NLTK WordNet (EN), euristiche (IT), wordfreq
  • GUI: Tkinter (Text area, combo lingua, tabella risultati, grafico matplotlib)

Prerequisiti (una sola volta):

  • pip install spacy nltk wordfreq matplotlib
  • python -m spacy download it_core_news_sm
  • python -m spacy download en_core_web_sm
  • In Python: import nltk; nltk.download(‘wordnet’); nltk.download(‘omw-1.4’)
python
import re
import tkinter as tk
from tkinter import ttk, scrolledtext, messagebox, filedialog
from collections import Counter, defaultdict

import spacy
from spacy.lang.it.stop_words import STOP_WORDS as IT_STOP
from spacy.lang.en.stop_words import STOP_WORDS as EN_STOP

import matplotlib
matplotlib.use("TkAgg")
import matplotlib.pyplot as plt
from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg

from wordfreq import zipf_frequency

# NLTK WordNet (EN only)
from nltk.corpus import wordnet as wn

# -----------------------
# Language resources
# -----------------------
NLP_MODELS = {}
STOPWORDS = {
    "it": IT_STOP,
    "en": EN_STOP
}
ABSTRACT_SUFFIXES_IT = ("ità", "zione", "tudine", "enza", "mento", "ismo", "logia", "ica")
ABSTRACT_SUFFIXES_EN = ("ness", "ity", "ism", "tion", "ment", "ship", "hood", "acy")

def get_nlp(lang):
    if lang not in NLP_MODELS:
        if lang == "it":
            NLP_MODELS[lang] = spacy.load("it_core_news_sm")
        elif lang == "en":
            NLP_MODELS[lang] = spacy.load("en_core_web_sm")
        else:
            raise ValueError("Lingua non supportata.")
    return NLP_MODELS[lang]

# -----------------------
# Concept extraction
# -----------------------
def normalize_space(text):
    return re.sub(r"\s+", " ", text.strip())

def strip_stopwords_inside(tokens, lang):
    stop = STOPWORDS.get(lang, set())
    return [t for t in tokens if t not in stop]

def noun_chunk_to_lemma(chunk):
    # Join lemmas of tokens that are part of the chunk and are not punct/space
    lemmas = []
    for t in chunk:
        if t.is_space or t.is_punct:
            continue
        lemmas.append(t.lemma_.lower())
    return " ".join(lemmas)

def extract_concepts(doc, lang):
    # Candidate concepts: noun chunks + NOUN/PROPN lemmas
    candidates = []

    # Multi-word noun chunks
    for ch in doc.noun_chunks:
        lemma_chunk = noun_chunk_to_lemma(ch)
        # Remove internal stopwords for multiword normalization
        tokens = [w for w in lemma_chunk.split() if w.isalpha()]
        tokens = strip_stopwords_inside(tokens, lang)
        if tokens:
            candidates.append(" ".join(tokens))

    # Single nouns and proper nouns
    for tok in doc:
        if tok.pos_ in ("NOUN", "PROPN") and not tok.is_stop and tok.is_alpha:
            candidates.append(tok.lemma_.lower())

    # Clean and dedupe
    cleaned = []
    for c in candidates:
        c = c.strip()
        c = re.sub(r"\s+", " ", c)
        if c:
            cleaned.append(c)

    # Prefer multi-word terms over overlapping single words:
    # Keep both but frequency will naturally reflect salience.
    counts = Counter(cleaned)
    return counts

# -----------------------
# Complexity factors
# -----------------------
def normalize(x, lo, hi):
    if hi <= lo:
        return 0.0
    v = (x - lo) / (hi - lo)
    return max(0.0, min(1.0, v))

def fd_semantic_depth(concept, lang):
    """
    Fd(c): semantic depth proxy.
    EN: use WordNet max hypernym depth of head lemma (approx: last token).
    IT: proxy via rarity (zipf frequency).
    """
    head = concept.split()[-1]  # heuristic: head as last token
    if lang == "en":
        synsets = wn.synsets(head, pos=wn.NOUN)
        if synsets:
            depths = []
            for s in synsets:
                try:
                    depths.append(s.max_depth())
                except Exception:
                    pass
            if depths:
                # WordNet noun depth typical range ~0..20
                return normalize(max(depths), 0, 20)
        # Fallback to rarity proxy
        z = zipf_frequency(head, "en")
        return normalize(7 - z, 0, 7)  # higher when rarer
    else:
        # IT: rarity proxy
        z = zipf_frequency(head, "it")
        return normalize(7 - z, 0, 7)

def fa_abstraction(concept, lang):
    """
    Fa(c): abstraction factor.
    EN: check affiliation to 'abstraction' branch or abstract suffixes; multiword length boost.
    IT: suffix heuristics for abstraction; penalize proper names (titlecase single-token).
    """
    tokens = concept.split()
    head = tokens[-1] if tokens else concept

    # Multiword boost (abstract notions are often multiword/technical)
    multiword_bonus = normalize(len(tokens), 1, 4) * 0.3  # up to +0.3

    if lang == "en":
        # WordNet abstraction lineage
        abstract_score = 0.0
        synsets = wn.synsets(head, pos=wn.NOUN)
        if synsets:
            for s in synsets:
                try:
                    for path in s.hypernym_paths():
                        if any((ss.name().startswith("abstraction.n.") for ss in path)):
                            abstract_score = max(abstract_score, 0.7)  # strong hint of abstraction
                except Exception:
                    pass
        # Suffix heuristic
        if head.endswith(ABSTRACT_SUFFIXES_EN):
            abstract_score = max(abstract_score, 0.6)
        return min(1.0, abstract_score + multiword_bonus)
    else:
        # IT: suffix heuristic
        abstract_score = 0.0
        if head.endswith(ABSTRACT_SUFFIXES_IT):
            abstract_score = max(abstract_score, 0.6)
        # Penalize likely proper names (single token titlecase) by reducing abstraction
        if len(tokens) == 1 and head.istitle():
            abstract_score = max(abstract_score - 0.2, 0.0)
        return min(1.0, abstract_score + multiword_bonus)

def concept_weight(concept, lang, alpha=0.5, beta=0.5):
    fd = fd_semantic_depth(concept, lang)
    fa = fa_abstraction(concept, lang)
    w = alpha * fd + beta * fa
    return max(0.0, min(1.0, w)), fd, fa

# -----------------------
# Metrics
# -----------------------
def compute_metrics(text, lang):
    nlp = get_nlp(lang)
    doc = nlp(text)

    # total tokens excluding punct/space
    N = sum(1 for t in doc if (not t.is_space and not t.is_punct))

    concept_counts = extract_concepts(doc, lang)
    C_unique = len(concept_counts)
    total_mentions = sum(concept_counts.values()) if concept_counts else 0

    # DCg
    DCg = (C_unique / N) if N > 0 else 0.0

    # Weights
    concept_data = []
    total_weight = 0.0
    for c, freq in concept_counts.items():
        w, fd, fa = concept_weight(c, lang)
        total_weight += w
        concept_data.append({
            "concept": c,
            "freq": freq,
            "w": w,
            "fd": fd,
            "fa": fa
        })

    # DCp
    DCp = (total_weight / N) if N > 0 else 0.0

    # IRC: 1 - |C| / total_mentions
    IRC = 0.0
    if total_mentions > 0:
        IRC = 1.0 - (C_unique / total_mentions)
        IRC = max(0.0, min(1.0, IRC))

    # EI
    EI = DCp * (1.0 - IRC)

    # Sort for display
    concept_data.sort(key=lambda x: (x["w"], x["freq"]), reverse=True)

    return {
        "N": N,
        "C_unique": C_unique,
        "total_mentions": total_mentions,
        "DCg": DCg,
        "DCp": DCp,
        "IRC": IRC,
        "EI": EI,
        "concepts": concept_data
    }

# -----------------------
# GUI
# -----------------------
class ConceptometryApp:
    def __init__(self, root):
        self.root = root
        root.title("Concettometria — Analisi")
        root.geometry("1100x750")

        self.lang_var = tk.StringVar(value="it")

        # Controls frame
        top = ttk.Frame(root, padding=10)
        top.pack(side=tk.TOP, fill=tk.X)

        ttk.Label(top, text="Lingua:").pack(side=tk.LEFT)
        self.lang_combo = ttk.Combobox(top, textvariable=self.lang_var, values=["it", "en"], width=5, state="readonly")
        self.lang_combo.pack(side=tk.LEFT, padx=5)

        ttk.Button(top, text="Analizza", command=self.run_analysis).pack(side=tk.LEFT, padx=5)
        ttk.Button(top, text="Carica file…", command=self.load_file).pack(side=tk.LEFT, padx=5)
        ttk.Button(top, text="Esporta CSV", command=self.export_csv).pack(side=tk.LEFT, padx=5)

        # Text input
        self.text_area = scrolledtext.ScrolledText(root, wrap=tk.WORD, height=12, font=("Segoe UI", 11))
        self.text_area.pack(fill=tk.BOTH, expand=False, padx=10, pady=5)

        # Metrics frame
        self.metrics_frame = ttk.LabelFrame(root, text="Metriche", padding=10)
        self.metrics_frame.pack(fill=tk.X, padx=10, pady=5)

        self.metrics_vars = {
            "N": tk.StringVar(value="-"),
            "C_unique": tk.StringVar(value="-"),
            "total_mentions": tk.StringVar(value="-"),
            "DCg": tk.StringVar(value="-"),
            "DCp": tk.StringVar(value="-"),
            "IRC": tk.StringVar(value="-"),
            "EI": tk.StringVar(value="-")
        }

        grid = ttk.Frame(self.metrics_frame)
        grid.pack(fill=tk.X)
        row = 0
        for label, key in [("Token (N)", "N"),
                           ("Concetti unici (|C|)", "C_unique"),
                           ("Menzioni concetti", "total_mentions"),
                           ("DCg", "DCg"),
                           ("DCp", "DCp"),
                           ("IRC", "IRC"),
                           ("EI", "EI")]:
            ttk.Label(grid, text=label + ":").grid(row=row, column=0, sticky="w", padx=5, pady=2)
            ttk.Label(grid, textvariable=self.metrics_vars[key]).grid(row=row, column=1, sticky="w", padx=5, pady=2)
            row += 1

        # Table of concepts
        self.table_frame = ttk.LabelFrame(root, text="Concetti (Top)", padding=10)
        self.table_frame.pack(fill=tk.BOTH, expand=True, padx=10, pady=5)

        cols = ("concept", "freq", "w", "fd", "fa")
        self.tree = ttk.Treeview(self.table_frame, columns=cols, show="headings", height=10)
        headings = {
            "concept": "Concetto",
            "freq": "Frequenza",
            "w": "Peso w(c)",
            "fd": "Fd",
            "fa": "Fa"
        }
        for c in cols:
            self.tree.heading(c, text=headings[c])
            self.tree.column(c, anchor="w", width=160 if c == "concept" else 100)

        self.tree.pack(fill=tk.BOTH, expand=True)

        # Chart
        self.chart_frame = ttk.LabelFrame(root, text="Top concetti per peso", padding=10)
        self.chart_frame.pack(fill=tk.BOTH, expand=True, padx=10, pady=5)

        self.figure = plt.Figure(figsize=(7, 3), dpi=100)
        self.ax = self.figure.add_subplot(111)
        self.canvas = FigureCanvasTkAgg(self.figure, master=self.chart_frame)
        self.canvas.get_tk_widget().pack(fill=tk.BOTH, expand=True)

        # Data
        self.last_results = None

    def run_analysis(self):
        text = normalize_space(self.text_area.get("1.0", tk.END))
        if not text:
            messagebox.showwarning("Attenzione", "Inserisci del testo da analizzare.")
            return
        lang = self.lang_var.get()
        try:
            results = compute_metrics(text, lang)
        except Exception as e:
            messagebox.showerror("Errore", f"Analisi fallita:\n{e}")
            return

        self.last_results = results
        self.update_metrics(results)
        self.update_table(results)
        self.update_chart(results)

    def update_metrics(self, res):
        self.metrics_vars["N"].set(res["N"])
        self.metrics_vars["C_unique"].set(res["C_unique"])
        self.metrics_vars["total_mentions"].set(res["total_mentions"])
        self.metrics_vars["DCg"].set(f"{res['DCg']:.4f}")
        self.metrics_vars["DCp"].set(f"{res['DCp']:.4f}")
        self.metrics_vars["IRC"].set(f"{res['IRC']:.4f}")
        self.metrics_vars["EI"].set(f"{res['EI']:.4f}")

    def update_table(self, res, top_k=30):
        for row in self.tree.get_children():
            self.tree.delete(row)
        for item in res["concepts"][:top_k]:
            self.tree.insert("", "end", values=(
                item["concept"],
                item["freq"],
                f"{item['w']:.3f}",
                f"{item['fd']:.3f}",
                f"{item['fa']:.3f}"
            ))

    def update_chart(self, res, top_k=10):
        self.ax.clear()
        data = res["concepts"][:top_k]
        if not data:
            self.canvas.draw()
            return
        labels = [d["concept"] for d in data]
        weights = [d["w"] for d in data]
        bars = self.ax.barh(range(len(labels)), weights, color="#3b82f6")
        self.ax.set_yticks(range(len(labels)))
        self.ax.set_yticklabels(labels)
        self.ax.invert_yaxis()
        self.ax.set_xlabel("Peso w(c)")
        self.ax.set_xlim(0, 1)
        for i, b in enumerate(bars):
            self.ax.text(b.get_width() + 0.01, b.get_y() + b.get_height()/2, f"{weights[i]:.2f}", va="center")
        self.figure.tight_layout()
        self.canvas.draw()

    def load_file(self):
        fp = filedialog.askopenfilename(filetypes=[("Testo", "*.txt"), ("Tutti i file", "*.*")])
        if not fp:
            return
        try:
            with open(fp, "r", encoding="utf-8") as f:
                content = f.read()
            self.text_area.delete("1.0", tk.END)
            self.text_area.insert(tk.END, content)
        except Exception as e:
            messagebox.showerror("Errore", f"Impossibile aprire il file:\n{e}")

    def export_csv(self):
        if not self.last_results:
            messagebox.showwarning("Attenzione", "Esegui prima un'analisi.")
            return
        fp = filedialog.asksaveasfilename(defaultextension=".csv", filetypes=[("CSV", "*.csv")])
        if not fp:
            return
        try:
            import csv
            with open(fp, "w", encoding="utf-8", newline="") as f:
                writer = csv.writer(f, delimiter=";")
                writer.writerow(["metric", "value"])
                for k in ["N", "C_unique", "total_mentions", "DCg", "DCp", "IRC", "EI"]:
                    writer.writerow([k, self.last_results[k]])
                writer.writerow([])
                writer.writerow(["concept", "freq", "w", "fd", "fa"])
                for item in self.last_results["concepts"]:
                    writer.writerow([item["concept"], item["freq"], f"{item['w']:.6f}", f"{item['fd']:.6f}", f"{item['fa']:.6f}"])
            messagebox.showinfo("OK", "Esportazione completata.")
        except Exception as e:
            messagebox.showerror("Errore", f"Esportazione fallita:\n{e}")

def main():
    root = tk.Tk()
    app = ConceptometryApp(root)
    root.mainloop()

if __name__ == "__main__":
    main()

Come usarlo in pratica

  • Incolla il testo di Borges (“La biblioteca di Babele”) nella text area.
  • Seleziona lingua “it” o “en” in base alla versione del testo.
  • Clicca “Analizza” per vedere:
    • DCg, DCp, IRC, EI.
    • I concetti più pesanti (con Fd, Fa).
    • Un grafico dei top concetti.

Suggerimenti:

  • Per confronti, analizza sezioni diverse (es. primi 3 paragrafi vs ultimi 3).
  • Esporta CSV per integrare i risultati nel tuo paper.

Limiti e possibili estensioni

  • WordNet copre bene l’inglese; per l’italiano usiamo euristiche. Estensione ideale: Open Multilingual WordNet o BabelNet per profondità semantica IT.
  • Migliorare Fa con un lessico di concretezza (es. Brysbaert) o modelli neurali di astrazione.
  • Aggiungere una finestra scorrevole su frasi/paragrafi per mappare la distribuzione concettuale lungo il testo.
  • Integrare keyphrase extraction più sofisticata (TextRank/YAKE) e disambiguazione di senso (WSD).