Toy example

If you never used Gismo before, you should probably start with this tutorial.

A typical Gismo workflow stands as follows: - Its input is a list of objects, called the source; - A source is wrapped into a Corpus object; - A dual embedding is computed that relates objects and their content; - The embedding fuels a query-based ranking function; - The best results of a query can be organized in a hierarchical way.

Source

[1]:
from gismo.common import toy_source_dict
toy_source_dict
[1]:
[{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'},
 {'title': 'Second Document', 'content': 'This is a sentence about Blade.'},
 {'title': 'Third Document',
  'content': 'This is another sentence about Shadoks.'},
 {'title': 'Fourth Document',
  'content': 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.'},
 {'title': 'Fifth Document',
  'content': 'In chinese folklore, a Mogwaï is a demon.'}]

Corpus

The to_text parameter tells how to turn a source object into text (str). iterate_text allows to iterate over the textified objects.

[2]:
from gismo.corpus import Corpus
corpus = Corpus(source=toy_source_dict, to_text=lambda x: x['content'])
print("\n".join(corpus.iterate_text()))
Gizmo is a Mogwaï.
This is a sentence about Blade.
This is another sentence about Shadoks.
This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.
In chinese folklore, a Mogwaï is a demon.

Embedding

The Gismo embedding relies on sklearn’s CountVectorizer to extract features (words) from text. If no vectorizer is provided to the constructor, a default one will be provided, but it is good practice to shape one’s own vectorizer to have a fine control of the parameters.

Note: always set dtype=float when building your vectorizer, as the default int type will break things.

[3]:
from gismo.embedding import Embedding
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(dtype=float)
embedding = Embedding(vectorizer=vectorizer)

The fit_transform method builds the embedding. It combines the fit and transform methods. - fit computes the vocabulary (list of features) of the corpus and their IDF weights. - transform computes the ITF weights of the documents and the embeddings of documents and features.

[4]:
embedding.fit_transform(corpus)

After fitting a corpus, the features can be accessed through features.

[5]:
", ".join(embedding.features)
[5]:
'about, and, another, at, blade, by, chinese, comparing, demon, folklore, gizmo, gremlins, in, inside, is, long, lot, makes, mogwaï, movie, of, point, reference, sentence, shadoks, side, some, star, stuff, the, this, to, very, wars, with, yoda'

After transformation, a dual embedding is available between the èmbedding.n documents and the embedding.m features.

[6]:
embedding.n
[6]:
5
[7]:
embedding.m
[7]:
36

x is a stochastic csr matrix that represents documents as vectors of features.

[8]:
embedding.x
[8]:
<5x36 sparse matrix of type '<class 'numpy.float64'>'
        with 47 stored elements in Compressed Sparse Row format>

y is a stochastic csr matrix that represents features as vectors of documents.

[9]:
embedding.y
[9]:
<36x5 sparse matrix of type '<class 'numpy.float64'>'
        with 47 stored elements in Compressed Sparse Row format>

Ranking

To be able to rank documents according to a specific query, we construct a Gismo object from a corpus and an embedding.

[10]:
from gismo.gismo import Gismo
gismo = Gismo(corpus, embedding)

A query is made by using the rank method.

[11]:
gismo.rank("Gizmo")
[11]:
True

Results ordered by ranking (e.g. relevance to the query) are accessed through the get_documents_by_rank and get_features_by_rank methods. The number of returned results can be given in the parameters.

[12]:
gismo.get_documents_by_rank(k=5)
[12]:
[{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'},
 {'title': 'Fourth Document',
  'content': 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.'},
 {'title': 'Fifth Document',
  'content': 'In chinese folklore, a Mogwaï is a demon.'},
 {'title': 'Second Document', 'content': 'This is a sentence about Blade.'},
 {'title': 'Third Document',
  'content': 'This is another sentence about Shadoks.'}]

If not specified, the number of documents is automatically estimated.

[13]:
gismo.get_documents_by_rank()
[13]:
[{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'}]

As the dataset is small here, the default estimator is very conservative. We can use target_k to tune that.

[14]:
gismo.parameters.target_k = .2
gismo.get_documents_by_rank()
[14]:
[{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'},
 {'title': 'Fourth Document',
  'content': 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.'},
 {'title': 'Fifth Document',
  'content': 'In chinese folklore, a Mogwaï is a demon.'}]
[15]:
gismo.get_features_by_rank()
[15]:
['mogwaï', 'gizmo', 'is', 'in', 'demon', 'chinese', 'folklore']

By default, outputs are lists of raw documents and features. It can be convenient to post-process them by setting post_documents_item and post_features_item. Gismo provides a few basic post-processing functions.

[16]:
from gismo.post_processing import post_documents_item_content
gismo.post_documents_item = post_documents_item_content
[17]:
gismo.get_documents_by_rank()
[17]:
['Gizmo is a Mogwaï.',
 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.',
 'In chinese folklore, a Mogwaï is a demon.']

The ranking algorithm is hosted inside gismo.diteration. Runtime parameters are managed insode gismo.parameters. One of the most important parameter is alpha \(\in [0,1]\), which controls the range of the underlying graph diffusion. Small values of alpha will yield results close to the initial. Larger values will take more into account the relationships between documents and features.

[18]:
gismo.parameters.alpha = .8
gismo.rank("Gizmo")
gismo.get_documents_by_rank()
[18]:
['Gizmo is a Mogwaï.',
 'In chinese folklore, a Mogwaï is a demon.',
 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.']

Clustering

Gismo can organize the best results into a tree through the get_documents_by_cluster and get_features_by_cluster methods. It is recommended to set post-processing functions.

[19]:
from gismo.post_processing import post_documents_cluster_print, post_features_cluster_print
gismo.post_documents_cluster = post_documents_cluster_print
gismo.post_features_cluster = post_features_cluster_print
[20]:
gismo.get_documents_by_cluster(k=5)
 F: 0.05. R: 1.85. S: 0.99.
- F: 0.68. R: 1.77. S: 0.98.
-- Gizmo is a Mogwaï. (R: 1.23; S: 0.98)
-- In chinese folklore, a Mogwaï is a demon. (R: 0.27; S: 0.72)
-- This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. (R: 0.26; S: 0.67)
- F: 0.70. R: 0.08. S: 0.19.
-- This is a sentence about Blade. (R: 0.04; S: 0.17)
-- This is another sentence about Shadoks. (R: 0.04; S: 0.17)

Note: for each leaf (documents here), the post-processing indicates the Relevance (ranking weight) and Similarity (cosine similarity) with respect to the query. For internal nodes (cluster of documents), a Focus value indicates how similar the documents inside the cluster are.

The depth of the tree is controlled by a resolution parameter \(\in [0, 1]\). Low resolution yields a flat tree (star structure).

[21]:
gismo.get_documents_by_cluster(k=5, resolution=.01)
 F: 0.04. R: 1.85. S: 0.99.
- Gizmo is a Mogwaï. (R: 1.23; S: 0.98)
- In chinese folklore, a Mogwaï is a demon. (R: 0.27; S: 0.72)
- This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. (R: 0.26; S: 0.67)
- This is a sentence about Blade. (R: 0.04; S: 0.17)
- This is another sentence about Shadoks. (R: 0.04; S: 0.17)

High resolution yields, up to ties, to a binary tree (dendrogram).

[22]:
gismo.get_documents_by_cluster(k=5, resolution=.9)
 F: 0.05. R: 1.85. S: 0.99.
- F: 0.58. R: 1.77. S: 0.98.
-- F: 0.69. R: 1.51. S: 0.98.
--- Gizmo is a Mogwaï. (R: 1.23; S: 0.98)
--- In chinese folklore, a Mogwaï is a demon. (R: 0.27; S: 0.72)
-- This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda. (R: 0.26; S: 0.67)
- F: 0.70. R: 0.08. S: 0.19.
-- This is a sentence about Blade. (R: 0.04; S: 0.17)
-- This is another sentence about Shadoks. (R: 0.04; S: 0.17)

The principle is the same for features.

[23]:
gismo.get_features_by_cluster(k=8)
 F: 0.00. R: 1.23. S: 0.93.
- F: 0.08. R: 1.22. S: 0.93.
-- F: 0.99. R: 1.03. S: 0.97.
--- mogwaï (R: 0.46; S: 0.98)
--- gizmo (R: 0.44; S: 0.96)
--- is (R: 0.13; S: 0.98)
-- F: 1.00. R: 0.18. S: 0.21.
--- in (R: 0.05; S: 0.21)
--- chinese (R: 0.05; S: 0.21)
--- folklore (R: 0.05; S: 0.21)
--- demon (R: 0.05; S: 0.21)
- blade (R: 0.01; S: 0.03)