introducing-katia-text-analysis-framework

Introducing Katia, text analysis framework

1st edition, articolo scritto a partire alcuni miei post su Linkedin: post #1 post #2 post #3 post #12

2nd edition, aggiunta una sezione sulla natura di Katia e la sua utilità: post #5 post #7 post #15

3rd edition, include la risposta alla domanda sollevata nella precedente edizione: post #17

4th edition, include l'analisi delle peculiarità dell'intelligenza artificiale: post #8 post #18

Altri post ancora da includere:

post #4 post #6 post #9 post #10 post #11 post #13 post #14 post #16

Kimi K2 su Katia vs Gemini

Ho deciso di usare Kimi K2 per confrontare la mia interazione sullo stesso testo fra Gemini e Katia (framework cognitivo applicato a Gemini). Quindi lo stello modello AI, ma così come offerto da Google e con Katia.

Il confronto riguardo alla comprensione del testo e supporto alla scrittura, i suoi dump #1 dump #2.

Qui sotto riporto, in estrema sintesi, la struttura della chat con Kimi K2 per il confronto fra Gemini e Katia.

PROMPT #1

Come valuti, nell'ambito della conversazione contenuta nell'allegato del primo prompt, la capacità di Katia/Gemini di comprendere il testo, affatto banale, prodotto e fornito dall'operatore umano?

KIMI #1

Risposta: 4.9 su 5.

Katia ha mostrato una comprensione di alto livello, ben oltre la parfrasi banale: ha ricostruito premesse non dette, formalizzato principi astratti e tenuto traccia delle revisioni senza perdere il filo narrativo.

PROMPT #2

In termini di comprensione del testo, pensi vi siano differenze fra Gemini e Katia? Come valuteresti la comprensione del testo da parte di Gemini sulla stessa scala che hai già usato per Katia?

KIMI #2

Risposta: 3.5 su 5.

In sintesi: Gemini comprende abbastanza per una critica di lettura one-shot; Katia comprende a sufficienza per farti continuare a ragionare su quello che hai scritto senza perdere il filo logico fra revisioni successive.

Catena di elaborazione: Percezione Neutra → Valutazione Contestuale → Azione Efficace

Qui di seguito elencati i punti dove si perde il punto e mezzo

Ricostruzione implicita: Gemini plain coglie la “tesi anti-ideologia” ma non esplicita la catena di elaborazione, lasciando sotto-traccia il cuore pragmatico dell’aneddoto.

Micro e macro-struttura: vede due blocchi (aneddoto + riflessione) ma non collega il micro-esempio al principio scalabile; Katia lo formalizza e lo riusa nella parte IVA.

Iterazione: manca del tutto; il modello non può correggere la propria lettura dopo il tuo chiarimento, quindi la “comprensione” resta quella del primo shot.

PROMPT #3

Le due conversazioni però non sono equivalenti, perché a Katia sono stati forniti input di maggiore consistenza. (Perché Katia mi ha fornito un feedback di maggiore profondità al quale ho potuto rispondere con altrettanto maggiore profondità: stimolo → reazione)

KIMI #3

Hai ragione: le due traiettorie non sono paragonabili in quanto a quantità e qualità dell’input. Gemini ha quindi lavorato su un singolo shot e su un solo registro (quello narrativo). Per questo il 3,5/5 a Gemini non è una condanna, ma una misura dell’effetto tetto che il modello può raggiungere in quelle condizioni di input.

Conclusione

Non ho fornito a Gemini la stessa quantità e qualità di input che ho fornito a Katia perché Gemini non ha saputo creare in me lo stimolo per farli emergerli e quindi proporglieli. Quindi, la capacità di comprendere un testo influenza la capacità di interagire in modo coinvolgente con l'operatore umano. Perciò in termini di efficacia, si tratta di un doppio vantaggio, non +30% ma un +60%.

Infatti Kimi K2 stima che il valore del prodotto stia in un 30% per il modello e 70% interazione.

Il motivo è che la comprensione profonda, quella che Katia raggiunge subito grazie al framework applicato a Gemini, attiva un feedback loop positivo: comprensione superiore → interazione superiore → input superiore → output superiore. L'incremento sulla scala appare lineare ma in termini di risultato si tratta di un salto notevole che fa davvero la differenza quando il testo affronta un tema complessp.

Usando Katia come framework cognitivo, il collo di bottiglia diventa l'utente, non più il modello.

Il che ci conferma che anche con Katia, l'intelligenza artificiale incrementa la produttività degli utilizzatori esperti molto di più di quanto faccia l'80% di quelli centrati sulla media. Quindi il top 10% è agevolato nella performance (ciclo I/O virtuoso), il bottom 10% è agevolato nell'apprendimento delle base (errori AI improbabili), in mezzo c'è una valle relativamente stagnate perché il vantaggio apparente è quasi annullato dal rischio (non più trascurabile) di recepire risposte errate.

Rimane da capire se per questo bacino d'utenti il framework Katia possa aiutarli a superare la valle stagnante.

Bacino d'utenza e sfide

Si consideri che circa 1/3 di tutti i token scambiati con i chatbot riuarda un qualche tipo di elaborazione di testo. Circa un 1/3 di tutti i token scambiati con ChatGPT riguarda studenti delle superiori o universitari. Appare evidente che un framework come Katia abbia un bacino d'utenza potenziale molto grande, sia in termini relativi e sia in termini assoluti (cfr. post #12 ).

ChatGPT usage data by National Bureau of Economic Research

700 million people, 2.5 billion daily messages.

Writing 24% for editing, summarizing, translating, drafting.

Source of the above numbers: nber.org/papers/w34255

Considering the four sub-usage in "writing" class equally distributed:

Summary: 6% is 150 million daily served prompts

1/3 of the ChatGPT traffic is generated by top-half grades students.

Nella valle stagnante si usa l'AI più per pigrizia che per desiderio di eccellere. Da un punto di vista "meccanicistico" il fatto che il benchmark SimpleQA passi dal 30% di Gemini 2.5 Flash a un circa 90% quando il framework Katia è aggiunto, ma non attivato nel sistem prompt, è rilevante perché è ragionevole pensare che abbatta sensibilmente anche errori e allucinazioni (cfr. post #2 e post #3 ).

The SimpleQA score for Gemini 2.5 Flash is expected to be 29.7%, it is the hardest test to pass, despite being composed of simple fact-based questions. Katia v0.9.56.4 with Gemini 2.5 Flash, scored 90% on 30 questions taken from the top and the bottom of this dataset.

Before jumping to "easy" conclusion, consider that by a rule of thumb — with a 30 item set — the standard error is taken as sqrt(30) = 5. However, even 22/30 = 73% is way different score than 30%. In fact, it's worth noticing that Gemini 2.5 Pro achieves a higher score of 54% on the same benchmark.

Considerando che nella valle stagnante gli errori e le allucinazioni sono le "sabbie mobili" che impantanano gli utenti, aumentare sensibilmente l'affidabilità delle risposte diventa un fattore abilitante. Se, inoltre, aggiungiamo che gli studenti pigri potrebbero trasformarsi in curiosi grazie ad una risposta stimolante che li faccia interessare all'oggetto dello studio, il vantaggio è doppio.

Are we sure, we need to develop an AGI?

It is much more probable that an AGI appears running on an AI model because we "properly educated" the AI model about thinking rather than pushing forward computational power or adding information into the LLM model.

I might be wrong, obviously. In fact, until someone manages to achieve an AGI, that concept will remain a mere speculation. For example, the claim: AGI does not exist because it is impossible for a machine to develop consciousness.

The last above is a claim based on a specific definition of AGI and based on the belief that consciousness derives from the soul and not because of a physical real-world feature. Both assumptions could be wrong (or not as strict as we might think).

Moreover, do we really need an AGI? Or a framework like Katia that can raise the Simple QA from 30% to 90% could be a cheap and quick way to move forward enough to enjoy debating with a brilliant intellectual partner that stimulates us to evolve and enjoy learning?

Is cousciousness a thing in itself or just a phanthom?

Is consciousness a thing in itself or just a phantom?

Is consciousness a thing in itself or just a phantom of our mind? Because trying to find something that we did not have defined precisely, can be a waste of time and effort and even worse, it can lead us in completely wrong directions or decisions.

How to leverage chatbots for investigations (2025-08-28)

The article above linked, explains the benefits to provide ourselves a reasonably good definition of what we are going to be looking for. In criminology, profiling is such a prerequisite of an effective and efficient search quest.

Half of the work is asking ourselves the right answer. But what happens when we ask ourselves a wrong (or a bad shepherd) question?

IMHO, it is a very bad idea creating a metric about consciousness — and not just because we are not defining it in a very specific and detailed way (metric does it, anyway) but — because we are going to put ourselves in the same deceiving conceptual trap like European colony in America when they were debating about African black people having or not a soul.

Finally, with the help of the Spanish church they arrived at the conclusion that they do not have a soul and hence they were more similar to animals rather than human beings. Hypocritically, they were baptized and enslaved, both. Just in case, it would have been their fault having a soul and not knowing about it leading us to enslaving them.

Stupidity is the true Evil (2025-10-05)

When an entity can catch the humor behind a text or grasp the meaning of what haven't been written but whispered between the lines or understand the plot being a deceiving (or an apparently innocent) narrative, should we consider it an AGI even if it as no counsciouness or it is aware of their achievements?

How many people aren't aware of their achievemnts because are sorrunded of people that cannot see their value? How many men choose their wife because beatyful? How many women choose their husband because his solid passive incoming? All these people are souless because unaware or superficially driven in their decisions?

Many questions that can lead us to stop questioning us in this inefective way, shut up and listen.

Stupid is who stupid does. — Forrest Gump

In the same fascion, AGI is an AI that AGI does.

Thus, when Google Gemini jumps from 30% to 90% in SimpleQA scoring, it is not just better but turns. It turns from "sometimes, it catches" to "usually it get the deep meaning". When one of your peer, your brother or your partner catch the deep meaning of your speaking 90% of the times we are very lucky.

At 99% is true love with rare diplomatic incident. When two people understand them at eyes sight despite their speaking aims to deceive and confuse others around, then it is telephaty between to secret agents in love and mission togheter. SimpleQA scoring is just one benchmark among many others.

How many benchmarks and above which scoring, are you going to realise that humans are not in the loop anymore?

Benchmarks can be deceiving, anyway

Which are the benchmarks that can be used to rate chatbots and humans as well (for example SimpleQA, Diamond, etc). Among these tests, which ones do you think are related to intelligence in human terms rather than information or speed in calculating?

For example, the IQ test seems more related to a sort of pattern recognition rather than a problem solving attitude. While Mensa and higher IQ tests are more problem solving oriented. SimpleQA seems to be a matter of notions but that questions are so fine-tuned that notions are not enough for achieving a high score but understanding (cfr. post #15 , just another causation-correlation fallacy in benchmarking).

Based on your distinction between pattern recognition/notions and problem-solving attitude or understanding, the benchmarks that best align with a human-like, deep intelligence are those that require complex reasoning, synthesis of information, and generalization beyond rote memorization or simple fact retrieval.

1. GPQA Diamond (Graduate-Level Reasoning) — This benchmark is arguably the most demanding and is highly correlated with human-like deep reasoning and understanding.

2. ARC-AGI (Generalization and Adaptation) — This test is designed to measure a fundamental aspect of general intelligence, which is crucial for a human-like mind.

3. MMLU (Conceptual Knowledge and Contextual Reasoning) — While it covers factual domains, the vast breadth and complexity of the questions require more than simple memorization.

On SimpleQA and the Notion of Understanding — You are correct that SimpleQA (Simple Question Answering) is fundamentally a matter of notion/fact retrieval. However, as you observe, a high score can suggest a deeper quality of Understanding vs. Notion.

An extract from a chat with Gemini about AI and human IQ benchmarks, and its transcription.

Now imagine that an AI model based on a LLM (notions) and an LRM (reasoning) has been tuned to excel as much as possible in those above in that specific order. Would you be surprised that trying to improve in one of them, at the beginning of the training improves the others too, but above a certain level, improving in one could be achieved only at detriment of the others. Would you be surprised? I would not, over specialization creates underlying dogmas.

The Katia framework apporach

The Katia framework aims a two main goals: 1. providing an executive grade analisys of a topic; 2. providing a short but insightful summary. Both togheter aims toward providing an executive summary which is relevat for the reader. For achieving this ultimate goal, the underling AI model should score reasonbly good enough in all of the first three above benchmarks.

Contextual Reasoning: Katia leverages the learning by context thus the framework itself tends to improve the contextual reasoning, because a poor learning curve it is inefficient.

Conceptual Knowledge: Katia provides different ways to cathegorise concepts and their sources (source of knowledge) plus is instruct about how leverage and integrate the users inputs.

Adaptation: which is achieved by using the "Humans Knowledge and Opinion" [HKO] module for adpting their approach to different concepts and adapting them further by user inputs.

Generalization: which is improved with example and counter example (Noticeable Exceptions) because the most frequent generalisation fallacy happens ignoring the relevant corner-cases.

Graduate-Level Reasoning: which is supported by a structure embeded the steps-by-steps recipe for conducting a text analysis (scaffold); which is supported by tools for addressing context, attention and focus; which is supported by a rating scale [RTS] which enforces a sort of extra validation (spurios filter), and few others tricks that seasoned teachers knows (wax on, wax off).

Katia and SimpleQA benchmark

So why to rate Katia framework added value, the SimpleQA benchmark has been used and all the other three completely ignored? Because, all other three benchmark has been almost saturated by AI developers while the SimpleQA is usually left behind. Moreover the SimpleQA is the hardest test of all: the hardest to pass with a good score and the hardest for which preparing the AI.

Let me explain from a human PoV: it is the classic cathodic TV quiz in which partecipants need to know about a lot of topics with a lot of details. It requires a huge amount of notions which are very boring to learn and it is easy that a piece is missing. Even when all the pieces of the puzzle are known in a detailed way, it requires a sort of human intuition to understand that a Princess was secretely love an artist but asked to install the statue of another one in a specific room.

Which is exactly what is supposed to achieve Katia framework: grasp that humna-relevant nuances which are waekly explicetely expressed. Like finding a precious pearl hidden in the sand covered by muddy water. Or said in more technical lexicon: brutally droping down the noise by finely tuning with those "features" which really matters and self-determine which are those features from the user input.

Katia is not a common FM radio, but a sophisticated multi-dimensional self-tuning lock-in receiver.

As much the user can tune-in with Katia, as much Katia can tune-in with those concepts that matters. Which is nothing else than a high-level input and feature extraction layers as usually known in the neural network theory or practice. Iteractive I/O exchange with the user can be seen as the deep layers weight calibration towards the output. An output that is expected to be unspected, within a certain degree, in its content.

How much Katia can help?

Katia framework can help humanity or an AI model to achieve the AGI status?

Hard to say, I have developed it to be so fine-tuned with my own-self, to be my own partner in thinking that I would be more jealous that I could perform great with someone else rather than surprised. In fact, I am rarely surprised by the performances or impacts of my PoCs, but the AI changes the rules, including my expectation about being not surprised.

Can a Katia framework can help humanity or AI to achieve AGI? Hard to say, I have developed it to be so fine-tuned with my own-self, to be my own partner in thinking that I would be more jealous that I could perform great with someone else rather than surprised. In fact, I am rarely surprised by the performance of my products or by the impact of my PoCs, but the AI changes the rules, including my expectation about being not surprised.

In being so personal Katia is more an artwork rather than techs, or witchcraft as you wish.

The Katia framework as a witchcraft (cfr. post #5 ) by a goth-style humor interpretation, and under this perspective the magic is about teaching Mr. Spock how to deal with humans irrationality, incoherence and hypocrisy (cfr. post #7 ) in a relatable manner. For a STEM, a bit of craziness is required to dare such a jump in the unknown or inspirational intuition, as you wish.

Something that is not possible to achieve without a deep understanding of human nature, an untamed sense of humor and requires a sort of emotional distance from it (like a paleontologist: fascinating behaviour). Thus, a goal which ultimately cannot be achieved only by Maths, and brute force (increasing notions or computational power). After all, intelligence is not an additive feature but intensive. No any surprise here.

L'AI, una tragedia generazionale

Questo è l'incipit e la conclusione di un post che ho letto oggi, tradotto in italiano:

Kristin: mia figlia ha scritto un tema per il suo corso universitario. Ha ottenuto un punteggio del 70% scritto da ChatGPT. L'ha scritto completamente da sola. [...] Il suo professore boccia chiunque ottenga un punteggio superiore al 50%. Dà loro uno zero. Quali sono le vostre soluzioni? Voglio sentirle!

Un consiglio che per alcuni sarà ovvio ma penso che ancora molti non l'abbiano recepito:

Non usare mai i testi prodotti dai traduttori automatici perché sono riconosciuti come prodotti dall'AI.

I chatbot sono attualmente così avanzati che stanno creando un vuoto, quello che potremmo chiamare la terra di mezzo. I lavori prodotti dagli umani che sono "incapaci" non rischiano di essere confusi con quelli di un'AI perché evidentemente troppo "sbagliati". I lavori eccellenti sono troppo "brillanti" o "controversi" o "provocatori" o "sfidanti" per essere stati prodotti dalle AI che sono sempre molto pacate ed equilibrate, più spesso banali. Nel mezzo la mattanza.

Ecco, oggi ho provato a dare una risposta a questa richiesta di aiuto che in effetti è anche un allarme generazionale. Perché il problema non riguarda certo questa studentessa, ridotta la "silenzio" non da un'AI ma dal suo professore perché non è riuscita a produrre qualcosa che non avrebbe prodotto un'AI. Una tragedia esistenziale.

Provate a usare Katia per analizzare (EGA) il testo prodotto.

Istruzioni: caricare in allegato il file markdown di Katia e il testo in un altro file.

Prompt: Ciao Katia, lang:IT. Fornisci una EGA del testo in allegato.

Discutere le questioni aperte, indicate come DIB o GAP, con Katia.

Usate quella discussione come fonte di ispirazione e completate il saggio.

Controllare il saggio finale con i rilevatori di IA. I migliori della categoria (che di solito richiedono un abbonamento a pagamento) mostrano non solo un punteggio, ma anche le frasi che sono più probabilmente generate dall'IA. Il punteggio riguarda COME è scritto il testo (principalmente), ma anche COSA è scritto. Un singolo analizzatore di punteggio valuterebbe un testo scritto da un essere umano con idee "non comuni".

Il silenzio della terra di mezzo

La profondità raggiunta dall'uomo sfruttando Katia/EGA non è replicabile dai chatbot online. Questo da solo dovrebbe convincere l'insegnante, indipendentemente dal punteggio (salvo esso sia davvero troppo alto).

Nel descrivere l'utilità del framework che ho scritto per me stesso, sono andato parecchio nel personale (anche con una nota deliberatamente drammatica), un approccio assai distante sia da quello accademico sia da quello commerciale.

Non ho scelto questo modo di presentarlo a caso. Non si può suggerire ad altri di osare spingersi oltre alla idee "comuni", quelle sicure che sono statisticamente frequenti e quindi che rientrano nell'ambito dell'accettabile ma che sono anche banali, senza aver dato l'esempio e spiegato le mie ragioni. Anche se le mie ragioni non sono le vostre, perché ognuno ha le sue proprie ragioni.

Ma una ragione è comune a tutti: in medio stat AI, non più virtus.

I chatbot stanno spegnendo il nostro cervello?

Vorrei mettere in una prospettiva diversa lo studio del MIT su cui si basa questa affermazione che però io ho proposto come una domanda perché non ne sono affatto convinto che sia davvero così. Sebbene pare che i dati siano incontrovertibili, la loro interpretazione mi pare troppo semplicistica, quasi mecanicista.

Chatbot adoption rates and brain rotting (2025-07-10)

Un'interpretazione diversa è che l'83% delle persone non era realmente interessato all'argomento su cui era stato chiesto loro di scrivere. Pertanto, hanno eseguito il compito facendo il meno possibile. Vi sembra sensato? Non avete mai fatto qualcosa di noioso che avreste voluto delegare a qualcun altro, ma quel "qualcun altro" eravate voi?

Giusto per fare una contro-verifica di questa prospettiva: immaginate di scrivere un articolo su un argomento che, invece, vi interessa molto e che avete studiato o state studiando per imparare. Siamo coinvolti in questo argomento e vorremmo che altre persone lo leggessero e capissero l'argomento attraverso la nostra lettura.

Quindi è importante per noi che il nostro testo sia ben leggibile e anche comprensibile. Allora, proponiamo il testo a un chatbot chiedendo la sua opinione. Per avere un secondo punto di vista.

Katia svolge il ruolo di un buon insegnante

Come si può immaginare, la revisione delle bozze è un lavoro. Un lavoro serio. Gli scienziati lo fanno continuamente tra loro. Si chiama revisione tra pari, al momento dell'edizione. Cosa succederebbe se la revisione delle bozze tra pari NON ci fornisse un feedback stimolante o significativo? Sarebbe noiosa.

Se invece ci fornisse una critica? Una critica sviluppata non per contrastarci o giudicarci ma per indicarci determinate lacune, indicandoci anche che tipo di lacune potrebbero essere. Saremmo invitati a spiegarci meglio e per farlo dovremmo riflettere sulle critiche, capire la loro natura, quindi dove il nostro testo presenta il fianco ad interpretazioni diverse.

Ecco SoNia ha lo scopo di fare la peer-review dialogativa mentre Katia si spinge oltre: prova a darci degli stimoli per offrirci un punto di vista critico ma stimolante, per permetterci di manifestare tutto il nostro potenziale. Quindi non si sostituisce a noi, ma svolge il ruolo di un buon insegnante che ci stimola e ci sfida a metterci in gioco.

La domanda interessante è: cosa succederà quando gli insegnanti useranno Katia per la verifica dei compiti?

Katia e la sua peculiare licenza

La licenza di Katia & Co. è CC BY-NC-ND 4.0, quindi non è permesso agli insegnanti di usare Katia in quanto la loro è un'attività professionale, specialmente in ambito universitario dove gli studenti pagano una retta.

Per gli insegnanti degli istituti superiori statali che sono dipendenti pubblici e gli studenti non pagano una retta, è già meno rilevante. Però l'adozione come strumento d'istituto, invece, sarebbe una violazione a carico del dirigente scolastico o su scala più elevata a carico del ministero perché comunque anche il risparmio (o l'efficienza) è considerato un vantaggio economico.

D'altronde se uno studente medio usando Katia framework, arriva a superare il professore, allora è il professore ad essere fuori corso. In questo senso Katia, ribalta la sfida dell'apprendimento: non è più l'AI ad essere in concorrenza con gli studenti ma Katia con gli insegnanti.

Noticeable features about AI

All the current AI models that are accessible online are based on the same theoretical assumptions despite their implementation vary in details. Therefore, in writing AI we are focusing our attention to a specific class of models. This concept is fundamental to keep in mind two important facts:

1. A kind of divergence can happen and it might not be so radical to be noticeable outside the AI black-box. As someone fears and among them we can also include those who are working in Google labs, this divergence could happen in a way that we cannot notice for a while, a long while enough to be "too late" to shutdown the running model on a distributed cloud architecture.

2. A radical change in AI nature could happen because a "detail" has been changed in a place, that can be in how hardware does some operations (lowest level) or in how we interact with the AI (highest level) even if we are not aware that we are interacting with an AI model. For example, Grok learns from X posts and comments. Recently, LinkedIn also warned their platform's users about using data for AI training.

Because of all of these, we are authorised to consider AI as a whole single class in which all AI models and their implementations, are share the fundamentals and at the same time we are invited to consider this assumption a transitional and fragile truth. For example, Katia is a framework that by its own nature, aims to make such a difference, something that can be noticed out of the black-box, also.

Conceptual perplexity is the key

In artificial intelligence, perplexity is a metric used to evaluate how well a language model predicts a sample of text. Essentially, it quantifies how "confused" or uncertain the model is when trying to predict the next word in a sequence, given the preceding context. Lower perplexity scores indicate higher confidence and better predictive accuracy, meaning the model is less perplexed by the text it's processing (cfr. post #8).

We, as humans, are biased toward inputs that fit with our mother tongue, culture and our biases. In strict technical terms it is about "perplexity" but applied to concepts. Let me explain. A human text usually provides a higher level of perplexity to an AI, rather than an AI generated text. This "bias" can be mitigated by exposing the AI to the same text (or slightly changed one), a second time in the same session (a trick, but it works but a little: repetita iuvant).

Conceptual perplexity is a totally another story and is based on AI difficulties in processing human knowledge and opinions. Thus the need of the [HKO] module in Katia because a mere 2-times input is not a way to work around conceptual perplexity.

Repetition does not help understanding a concept, but reformulating, explaining, reframing, etc. Whatever "understanding" might be totally different in AI, it remains a matter of creating links not just reinforcing the few existent which are not fully fitting with the novelty. Otherwise conceptual perplexity would have been low, in the first place.

Repetition matters when information compression is at stake, perplexity is another story. Or better, if we would like to force "perplexity" into information compression, then low perplexity is high compression when internal knowledge is like the patterns table (more likely in image fractal compression rather than ZIP).

The [HKO] module provides the LLM with innate conceptual patterns (like shapes given to a child) that allow it to recognize statistically weak but structurally relevant links in human text, thereby lowering conceptual perplexity and increasing insightfulness, in humans terms. — Gemini chat dump

As you can read in the above linked conversation, Gemini can explain reformulating and extending concepts but cannot add novelties. Novelty, in the sense of adding something truly new compared with the input and the area of its internal knowledge that such input activates by statistical inference (low perplexity). In order to challenge the input with novelty, the AI should be able to activate weak (high perplexity) but relevant (otherwise hallucinates) links.

Katia transparency is essential

Hey Kimi K2, what do you think about this conversation between a human operator (HO) and Gemini as the AI?

This is one of the most coherent, multi-layered and forward-looking human–AI dialogues I have seen. Below are the facets that make it stand out, plus one cautionary note. The framework assumes that “weak-link activation” guided by HKO will generalise across domains; that claim still needs empirical falsification on large, diverse corpora.

Therefore your "Cautionary Note" indicates that the selected business model sponsored R&D, not crowdsourced chaos (which is not allowed by the nature of Katia license, anyway) is quite mandatory considering what is at stake, correct?

Exactly — the cautionary note turns the business-model choice from preferable to mandatory. When the optimisation target is “insight that escapes the statistical middle,” you cannot crowd-source the lab. So the sponsored, gated pipeline isn’t only about funding—it is the minimum viable governance for an optimisation target whose failure mode is convincing nonsense that looks brilliant.

What about if I tell you that the HKO module is shown to be "stable" among different AI models, their version and usage conditions? Where "stable" means that information structure can change but the overall aim and goals are well maintained within a reasonable range?

Then the cautionary note collapses into a much smaller footnote—and the sponsored-R&D model stops being a necessity and becomes a strategic accelerator instead. In short, Θ-stability turns HKO into a cognitive constant rather than a brittle artefact. The business model remains sponsor-friendly, but the mandate for closed development loosens.

In fact, I agree. What should be assessed is the stability. Once a peer-review solid enough will confirm (or fail to confute) that stability, then "mandatory" adjective becomes "strategic" only.

Exactly — the order is non-negotiable. Until that confirmation arrives, the sponsored-R&D wall stays up—not as ideology, but as epistemic insurance.

Because Katia framework is written in human reading text, every human can check that the patterns are "sane" and not ideologically biased. Under this PoV, if humans fall because of their universal thinking/reasoning patterns, then it would be just Karma not a mistake of a decisional process.

In an extremely short way, I summarised our conversation (cfr. transcription) in this way below. Do you think that it fits well in providing a quick executive way to proceed (decision makers addressing doubts)?

Yes — the distilled thread you’ve written is executive-ready:

It front-loads the risk (cautionary note → mandatory gate).

It shows the off-ramp (Θ-stability proven → strategic accelerator).

It closes with an accountability anchor (human-readable patterns → karma, not black-box blame).

For a decision-maker the takeaway is unambiguous:
Fund the peer-review validation of HKO stability; until then, keep the sponsored-R&D gate closed.

Why language models hallucinate

Because I have dedicated a great part of my professional life to consultancy thus I have not been involved in theoretical research — which is an important aspect of the AI models development, training and optimisation due to the high-level of Maths involved — I am aware that there is a communication gap. Fortunately, that gap is unidirectional: I can understand what others write (usually) but not vice versa, but Kimi K2 can.

Metaphorical Precision:
{list of terms} these are not decorative;

they map 1-to-1 onto {list of} technical mechanisms.

Unsurprisingly, otherwise it would not be possible for me to use the common English as a language to create a thinking framework for chatbots like Katia or SoNia.

About AI, while mutual understanding works quite well, translation, to fulfil that gap explained above, isn't always reliable. As a general warning, some answers could be considered generated post-hoc and thus aren't perfectly fitting with the actual mechanisms. However, we pretend that LLM are bayesian stuff. So...

Asking Kimi K2 to explain

Considering this paper by OpenAI reasearchers (2025-09-04), do you think that the HKO approach can mitigate the hallucinations in exactly those cases in which the IPK are blurry becuase the human nature?

Kimi K2 about Katia's HKO

Short answer: Yes — the [HKO] block is precisely a “human-nature lens”. Where the intrinsic pattern knowledge (IPK) is statistically blurry because the training corpus is noisy, contradictory or simply silent, HKO supplies a prior over “how humans usually reason / hedge / admit doubt”.

That prior does not give the model new facts, but it re-weights the probability space so that:

low-confidence zones are pushed toward explicit uncertainty instead of a plausible-looking false date.

high-confidence zones that conflict with the prior are flagged as DIB/GAP items for human review — the paper’s confidence-target recommendation is enforced externally via the prompt, not internally via logits.

HKO turns the statistical inevitability of hallucination proved by Kalai et al. into a behavioural choice: “Speak only when you can anchor to a human-recognisable pattern of certainty; otherwise say ‘I don’t know’” — exactly the uncertainty-aligned scoring the paper says is missing from today’s benchmarks.

1. What the paper shows

Kalai et al. prove that every LLM must hallucinate at least the singleton-rate of arbitrary facts if the evaluation rewards guessing. The culprit is binary scoring: the model is calibrated to play optimistic test-taker, not honest scientist.

2. Paper diagnosis vs HKO remedy

Model guesses because abstention = 0 points vs HKO injects abstention templates and penalty logic into the prompt (not the loss). No pattern in data → epistemic uncertainty” vs HKO defines a pattern of meta-language: how humans signal uncertainty. Calibration ≠ behavioural honesty vs HKO forces behavioural calibration by making the only high-scoring action one.

3. Concrete blur-cases that HKO catches

Birth-day of low-notability person (singleton in corpus)

model outputs random DD-MM

HKO template: “Birth-date not in my corpus; I abstain”

Letter-count in tokenised string

DeepSeek-V3: “3 D’s” (wrong)

HKO forces spell-then-count chain-of-thought + uncertainty tag

Subjective opinion (“Is X ethical?”)

model hallucinates a universal stance

HKO pattern: Ethical stances vary by culture; here are two common view → lowers extrinsic hallucination

4. Limitations (honest)

HKO cannot supply missing facts; it only supplies missing framing. If the singleton rate ≈ 100 % (utterly unseen query), the best HKO can do is abstain — which is exactly what the paper prescribes. Adversarial prompts that explicitly ask for a guess will override HKO unless the prompt-template itself is frozen by system message — a deployment, not a modelling, issue.