Reference¶

Gismo is made of multiple small modules designed to be mixed together.

corpus: The module contains simple wrappers to turn a wide range of document sources into something that Gismo will be able to process.
embedding: This module can create and manipulate TF-IDTF embeddings out of a corpus.
diteration: This module transforms queries into relevance vectors that can be used to rank and organize documents and features.
clustering: This module implements the tree-like organization of selected items
gismo: The main gismo module combines all modules above to provide high level, end-to-end, analysis methods.
landmarks: Introduced in v0.4, this high-level module allows deeper analysis of a small corpus by using individual query results for the embedding.
post processing: This module provides a simple, unified, way to apply automatic transformations (e.g. formatting) to the results of an analysis.
filesource: This module can be used to read documents one-by-one from disk instead of loading them all in memory. Useful for very large corpi.
sentencizer: This module can leverage a document-level gismo to provide sentence-level analysis. Can be used to extract key phrases (headlines).
datasets: Collection of access to small or less small datasets.
common: Multi-purpose module of things that can be used in more than one other module.
parameters: Management of runtime parameters.

Corpus¶

class gismo.corpus.Corpus(source=None, to_text=None)[source]¶

The Corpus class is the starting point of any Gismo workflow. It abstracts dataset pre-processing. It is just a list of items (called documents in Gismo) augmented with a method that describes how to convert a document to a string object. It is used to build an Embedding.

Parameters:

source (list) – The list of items that constitutes the dataset to analyze. Actually, any iterable object with __len__() and __getitem__() methods can potentially be used as a source (see FileSource for an example).
to_text (function, optional) – The function that transforms an item from the source into plain text (str). If not set, it will default to the identity function lambda x: x.

Examples

The following code uses the toy_source_text list as source and specifies that the text extraction method should be: take the 15 first characters and add ….

When we iterate with the iterate() method, observe that the extraction is not applied.

>>> corpus = Corpus(toy_source_text, to_text=lambda x: f"{x[:15]}...")
>>> for c in corpus.iterate():
...    print(c)
Gizmo is a Mogwaï.
This is a sentence about Blade.
This is another sentence about Shadoks.
This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.
In chinese folklore, a Mogwaï is a demon.

When we iterate with the iterate_text() method, observe that the extraction is applied.

>>> for c in corpus.iterate_text():
...    print(c)
Gizmo is a Mogw...
This is a sente...
This is another...
This very long ...
In chinese folk...

A corpus object can be saved/loaded with the dump() and load() methods inherited from the MixIn MixInIO class. The load() method is a class method to be used instead of the constructor.

>>> import tempfile
>>> corpus1 = Corpus(toy_source_text)
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...    corpus1.dump(filename="myfile", path=tmpdirname)
...    corpus2 = Corpus.load(filename="myfile", path=tmpdirname)
>>> corpus2[0]
'Gizmo is a Mogwaï.'

merge_new_source(new_source, doc2key=None)[source]¶

Incorporate new entries while avoiding the creation of duplicates. This method is typically used when you have a dynamic source like a RSS feed and you want to periodically update your corpus.

Parameters:	new_source (list) – Source compatible (e.g. similar item type) with the current source. doc2key (function) – Callback that provides items with unique hashable keys, used to avoid duplicates.

Examples

The following code uses the toy_source_dict list as source and add two new items, including a redundant one.

>>> corpus = Corpus(toy_source_dict.copy(), to_text=lambda x: x['content'][:14])
>>> len(corpus)
5
>>> new_corpus = [{"title": "Another document", "content": "I don't know what to say!"},
...     {'title': 'Fifth Document', 'content': 'In chinese folklore, a Mogwaï is a demon.'}]
>>> corpus.merge_new_source(new_corpus, doc2key=lambda e: e['title'])
>>> len(corpus)
6
>>> for c in corpus.iterate_text():
...    print(c)
Gizmo is a Mog
This is a sent
This is anothe
This very long
In chinese fol
I don't know w

class gismo.corpus.CorpusList(corpus_list=None, filename=None, path='.')[source]¶

This class makes a list of corpi behave like one single virtual corpus. This is useful to glue together corpi with distinct shapes and to_text() methods.

Parameters:	corpus_list (list of `Corpus`) – The list of corpi to glue.

Example

>>> multi_corp = CorpusList([Corpus(toy_source_text, lambda x: x[:15]+"..."),
...                          Corpus(toy_source_dict, lambda e: e['title'])])
>>> len(multi_corp)
10
>>> multi_corp[7]
{'title': 'Third Document', 'content': 'This is another sentence about Shadoks.'}
>>> for c in multi_corp.iterate_text():
...    print(c)
Gizmo is a Mogw...
This is a sente...
This is another...
This very long ...
In chinese folk...
First Document
Second Document
Third Document
Fourth Document
Fifth Document

Embedding¶

class gismo.embedding.Embedding(vectorizer=None)[source]¶

This class leverages the CountVectorizer class to build the dual embedding of a Corpus.

Documents are embedded in the space of features;
Features are embedded in the space of documents.

See the examples and methods below for all usages of the class.

Parameters:	vectorizer (`CountVectorizer`, optional) – Custom `CountVectorizer` to override default behavior (recommended). Having a `CountVectorizer` adapted to the `Corpus` is good practice.

fit(corpus)[source]¶

Learn features from a corpus of documents.

If not yet set, a default CountVectorizer is created.
Features are computed and stored.
Inverse-Document-Frequency weights of features are computed.

Parameters:	corpus (`Corpus`) – The corpus to ingest.

Example

>>> corpus=Corpus(toy_source_text)
>>> embedding = Embedding()
>>> embedding.fit(corpus)
>>> len(embedding.idf)
21
>>> list(embedding.features[:8])
['blade', 'chinese', 'comparing', 'demon', 'folklore', 'gizmo', 'gremlins', 'inside']

fit_ext(embedding)[source]¶

Use learned features from another Embedding. This is useful for the fast creation of local embeddings (e.g. at sentence level) out of a global embedding.

Parameters:	embedding (`Embedding`) – External embedding to copy.

Examples

>>> corpus=Corpus(toy_source_text)
>>> other_embedding = Embedding()
>>> other_embedding.fit(corpus)
>>> embedding = Embedding()
>>> embedding.fit_ext(other_embedding)
>>> len(embedding.idf)
21
>>> list(embedding.features[:8])
['blade', 'chinese', 'comparing', 'demon', 'folklore', 'gizmo', 'gremlins', 'inside']

fit_transform(corpus)[source]¶

Ingest a corpus of documents.

If not yet set, a default CountVectorizer is created.
Features are computed and stored (fit).
Inverse-Document-Frequency weights of features are computed (fit).
TF-IDF embedding of documents is computed and stored (transform).
TF-ITF embedding of features is computed and stored (transform).

Parameters:	corpus (`Corpus`) – The corpus to ingest.

Example

>>> corpus=Corpus(toy_source_text)
>>> embedding = Embedding()
>>> embedding.fit_transform(corpus)
>>> embedding.x  # doctest: +NORMALIZE_WHITESPACE
<5x21 sparse matrix of type '<class 'numpy.float64'>'
    with 25 stored elements in Compressed Sparse Row format>
>>> list(embedding.features[:8])
['blade', 'chinese', 'comparing', 'demon', 'folklore', 'gizmo', 'gremlins', 'inside']

query_projection(query)[source]¶

Project a query in the feature space.

Parameters:	query (`str`) – Text to project.
Returns:	z (`csr_matrix`) – result of the query projection (IDF distribution if query does not match any feature). success (`bool`) – projection success (`True` if at least one feature been found).

Example

>>> corpus=Corpus(toy_source_text)
>>> embedding = Embedding()
>>> embedding.fit_transform(corpus)
>>> z, success = embedding.query_projection("Gizmo is not Yoda but he rocks!")
>>> for i in range(len(z.data)):
...    print(f"{embedding.features[z.indices[i]]}: {z.data[i]}") # doctest: +ELLIPSIS
gizmo: 0.3868528072...
yoda: 0.6131471927...
>>> success
True
>>> z, success = embedding.query_projection("That content does not intersect toy corpus")
>>> success
False

transform(corpus)[source]¶

Ingest a corpus of documents using existing features. Requires that the embedding has been fitted beforehand.

TF-IDF embedding of documents is computed and stored.
TF-ITF embedding of features is computed and stored.

Parameters:	corpus (`Corpus`) – The corpus to ingest.

Example

>>> corpus=Corpus(toy_source_text)
>>> embedding = Embedding()
>>> embedding.fit_transform(corpus)
>>> [embedding.features[i] for i in embedding.x.indices[:8]]
['gizmo', 'mogwaï', 'blade', 'sentence', 'sentence', 'shadoks', 'comparing', 'gizmo']
>>> small_corpus = Corpus(["I only talk about Yoda", "Gizmo forever!"])
>>> embedding.transform(small_corpus)
>>> [embedding.features[i] for i in embedding.x.indices]
['yoda', 'gizmo']

gismo.embedding.auto_vect(corpus=None)[source]¶

Creates a default CountVectorizer compatible with the Embedding constructor. For not-too-small corpi, a slight frequency-filter is applied.

Parameters:	corpus (`Corpus`, optional) – The corpus for which the `CountVectorizer` is intended.
Returns:	A `CountVectorizer` object compatible with the `Embedding` constructor.
Return type:	`CountVectorizer`

gismo.embedding.idf_fit[source]¶

Computes the Inverse-Document-Frequency vector on sparse embedding y.

Parameters:	indptr (`ndarray`) – Pointers of the embedding y (e.g. y.indptr). n (int) – Number of documents.
Returns:	idf_vector – IDF vector of size m.
Return type:	`ndarray`

gismo.embedding.idf_transform[source]¶

Applies inplace Inverse-Document-Frequency transformation on sparse embedding y.

Parameters:	indptr (`ndarray`) – Pointers of the embedding y (e.g. y.indptr). data (`ndarray`) – Values of the embedding y (e.g. y.data). idf_vector (`ndarray`) – IDF vector of the embedding, obtained from `idf_fit()`.

gismo.embedding.itf_fit_transform[source]¶

Applies inplace Inverse-Term-Frequency transformation on sparse embedding x.

Parameters:	indptr (`ndarray`) – Pointers of the embedding (e.g. x.indptr). data (`ndarray`) – Values of the embedding (e.g. x.data). m (int) – Number of features

gismo.embedding.l1_normalize[source]¶

Computes L1 norm on sparse embedding (x or y) and applies inplace normalization.

Parameters:	indptr (`ndarray`) – Pointers of the embedding (e.g. x.indptr). data (`ndarray`) – Values of the embedding (e.g. x.data).
Returns:	l1_norm – L1 norms of all vectors of the embedding before normalization.
Return type:	`ndarray`

gismo.embedding.query_shape[source]¶

Applies inplace logarithmic smoothing, IDF weighting, and normalization to the output of the CountVectorizer transform() method.

Parameters:	indices (`ndarray`) – Indice attribute of the `csr_matrix` obtained from `transform()`. data (`ndarray`) – Data attribute of the `csr_matrix` obtained from `transform()`. idf (`ndarray`) – IDF vector of the embedding, obtained from `idf_fit()`.
Returns:	norm – The norm of the vector before normalization.
Return type:	float

DIteration¶

class gismo.diteration.DIteration(n, m)[source]¶

This class is in charge of performing the DIteration algorithm.

Parameters:	n (int) – Number of documents. m (int) – Number of features.

x_relevance¶

Relevance of documents.

Type:	`ndarray`

y_relevance¶

Relevance of features.

Type:	`ndarray`

x_order¶

Indices of documents sorted by relevance.

Type:	`ndarray`

y_order¶

Indices of features sorted by relevance.

Type:	`ndarray`

gismo.diteration.jit_diffusion[source]¶

//numba.pydata.org/>`_. This is where the DIteration algorithm is applied inline.

Parameters:

x_pointers (ndarray) – Pointers of the csr_matrix embedding of documents.
x_indices (ndarray) – Indices of the csr_matrix embedding of documents.
x_data (ndarray) – Data of the csr_matrix embedding of documents.
y_pointers (ndarray) – Pointers of the csr_matrix embedding of features.
y_indices (ndarray) – Indices of the csr_matrix embedding of features.
y_data (ndarray) – Data of the csr_matrix embedding of features.
z_indices (ndarray) – Indices of the csr_matrix embedding of the query projection.
z_data (ndarray) – Data of the csr_matrix embedding of the query_projection.
x_relevance (ndarray) – Placeholder for relevance of documents.
y_relevance (ndarray) – Placeholder for relevance of features.
alpha (float in range [0.0, 1.0]) – Damping factor. Controls the trade-off between closeness and centrality.
n_iter (int) – Number of round-trip diffusions to perform. Higher value means better precision but longer execution time.
offset (float in range [0.0, 1.0]) – Controls how much of the initial fluid should be deduced form the relevance.
x_fluid (ndarray) – Placeholder for fluid on the side of documents.
y_fluid (ndarray) – Placeholder for fluid on the side of features.

Type:

Core diffusion engine written to be compatible with `Numba <https

Clustering¶

class gismo.clustering.Cluster(indice=None, rank=None, vector=None)[source]¶

The ‘Cluster’ class is used for internal representation of hierarchical cluster. It stores the attributes that describe a clustering structure and provides cluster basic addition for merge operations.

Parameters:	indice (int) – Index of the head (main element) of the cluster. rank (int) – The ranking order of a cluster. vector (`csr_matrix`) – The vector representation of the cluster.

indice¶

Index of the head (main element) of the cluster.

Type:	int

rank¶

The ranking order of a cluster.

Type:	int

vector¶

The vector representation of the cluster.

Type:	`csr_matrix`

intersection_vector¶

The vector representation of the common points of a cluster.

Type:	`csr_matrix` (deprecated)

members¶

The indices of the cluster elements.

Type:	list of int

focus¶

The consistency of the cluster (higher focus means that elements are more similar).

Type:	float in range [0.0, 1.0]

children¶

The subclusters.

Type:	list of `Cluster`

Examples

>>> c1 = Cluster(indice=0, rank=1, vector=csr_matrix([1.0, 0.0, 1.0]))
>>> c2 = Cluster(indice=5, rank=0, vector=csr_matrix([1.0, 1.0, 0.0]))
>>> c3 = c1+c2
>>> c3.members
[0, 5]
>>> c3.indice
5
>>> c3.vector.toarray()
array([[2., 1., 1.]])
>>> c3.intersection_vector.toarray()
array([[1., 0., 0.]])
>>> c1 == sum([c1])
True

gismo.clustering.covering_order(cluster, wide=True)[source]¶

Uses a hierarchical cluster to provide an ordering of the items that mixes rank and coverage.

This is done by exploring all cluster and subclusters by increasing similarity and rank (lexicographic order). Two variants are proposed:

Core: for each cluster, append its representant to the list if new. Central items tend to have better rank.
Wide: for each cluster, append its children representants to the list if new. Marginal items tend to have better rank.

Parameters:	cluster (`Cluster`) – The cluster to explore. wide (`bool`) – Use Wide (`True`) or Core (`False`) variant.
Returns:	Sorted indices of the items of the cluster.
Return type:	list of int

gismo.clustering.get_sim(csr, arr)[source]¶

Simple similarity computation between csr_matrix and ndarray.

Parameters:	csr (`csr_matrix`) – arr (`ndarray`) –
Returns:
Return type:	float

gismo.clustering.merge_clusters(cluster_list: list, focus=1.0)[source]¶

Complete merge operation. In addition to the basic merge provided by Cluster, it ensures the following:

Consistency of focus by integrating the extra-focus (typically given by subspace_partition()).
Children (the members of the list) are sorted according to their respective rank.

Parameters:	cluster_list (list of `Cluster`) – The clusters to merge into one cluster. focus (float) – Evaluation of the focus (similarity) between clusters.
Returns:	The cluster merging the list.
Return type:	`Cluster`

gismo.clustering.rec_clusterize(cluster_list: list, resolution=0.7)[source]¶

Auxiliary recursive function for clustering.

Parameters:	cluster_list (list of `Cluster`) – Current aggregation state. resolution (float in range [0.0, 1.0]) – Sets the lazyness of aggregation. A ‘resolution’ set to 0.0 yields a one-step clustering (star structure), while a ‘resolution ‘ set to 1.0 yields, up to tie similarities, a binary tree (dendrogram).
Returns:
Return type:	list of `Cluster`

gismo.clustering.subspace_clusterize(subspace, resolution=0.7, indices=None)[source]¶

Converts a subspace (matrix seen as a list of vectors) to a Cluster object (hierarchical clustering).

Parameters:	subspace (`ndarray`, `csr_matrix`) – A `k x m` matrix seen as a list of `k` `m`-dimensional vectors sorted by importance order. resolution (float in range [0.0, 1.0]) – Sets the lazyness of aggregation. A ‘resolution’ set to 0.0 yields a one-step clustering (star structure), while a ‘resolution ‘ set to 1.0 yields, up to tie similarities, a binary tree (dendrogram). indices (list, optional) – Indicates the index for each element of the subspace. Used when ‘subspace’ is extracted from a larger space (e.g. X or Y). If not set, indices are set to `range(k)`.
Returns:	A cluster whose leaves are the k vectors from ‘subspace’.
Return type:	Cluster

Example

>>> corpus = Corpus(toy_source_text)
>>> vectorizer = CountVectorizer(dtype=float)
>>> embedding = Embedding(vectorizer=vectorizer)
>>> embedding.fit_transform(corpus)
>>> subspace = embedding.x[1:, :]
>>> cluster = subspace_clusterize(subspace)
>>> len(cluster.children)
2
>>> cluster = subspace_clusterize(subspace, resolution=.02)
>>> len(cluster.children)
4

gismo.clustering.subspace_distortion[source]¶

Apply inplace distortion of a subspace with relevance.

Parameters:	indices (`ndarray`) – Indice attribute of the subspace `csr_matrix`. data (`ndarray`) – Data attribute of the subspace `csr_matrix`. relevance (`ndarray`) – Relevance values in the embedding space. distortion (float in [0.0, 1.0]) – Power applied to relevance for distortion.

gismo.clustering.subspace_partition(subspace, resolution=0.7)[source]¶

Proposes a partition of the subspace that merges together vectors with a similar direction.

Parameters:	subspace (`ndarray`, `csr_matrix`) – A `k x m` matrix seen as a list of `k` `m`-dimensional vectors sorted by importance order. resolution (float in range [0.0, 1.0]) – How strict the merging should be. `0.0` will merge all items together, while `1.0` will only merge mutually closest items.
Returns:	A list of subsets that form a partition. Each subset is represented by a pair `(p, f)`. `p` is the set of indices of the subset, `f` is the typical similarity of the partition (called focus).
Return type:	list

Gismo¶

class gismo.gismo.Gismo(corpus=None, embedding=None, **kwargs)[source]¶

Gismo mixes a corpus and its embedding to provide search and structure methods.

Parameters:	corpus (Corpus) – Defines the documents of the gismo. embedding (Embedding) – Defines the embedding of the gismo. kwargs (dict) – Custom default runtime parameters. You just need to specify the parameters that differ from `DEFAULT_PARAMETERS`.

Example

The Corpus class defines how documents of a source should be converted to plain text.

>>> corpus = Corpus(toy_source_dict, lambda x: x['content'])

The Embedding class extracts features (e.g. words) and computes weights between documents and features.

>>> vectorizer = CountVectorizer(dtype=float)
>>> embedding = Embedding(vectorizer=vectorizer)
>>> embedding.fit_transform(corpus)
>>> embedding.m # number of features
36

The Gismo class combines them for performing queries. After a query is performed, one can ask for the best items. The number of items to return can be specified with parameter k or automatically adjusted.

>>> gismo = Gismo(corpus, embedding)
>>> success = gismo.rank("Gizmo")
>>> gismo.parameters.target_k = .2 # The toy dataset is very small, so we lower the auto_k parameter.
>>> gismo.get_documents_by_rank()
[{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'}, {'title': 'Fourth Document', 'content': 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.'}, {'title': 'Fifth Document', 'content': 'In chinese folklore, a Mogwaï is a demon.'}]

Post processing functions can be used to tweak the returned object (the underlying ranking is unchanged)

>>> gismo.post_documents_item = partial(post_documents_item_content, max_size=44)
>>> gismo.get_documents_by_rank()
['Gizmo is a Mogwaï.', 'This very long sentence, with a lot of stuff', 'In chinese folklore, a Mogwaï is a demon.']

Ranking also works on features.

>>> gismo.get_features_by_rank()
['mogwaï', 'gizmo', 'is', 'in', 'demon', 'chinese', 'folklore']

Clustering organizes results can provide additional hints on their relationships.

>>> gismo.post_documents_cluster = post_documents_cluster_print
>>> gismo.get_documents_by_cluster(resolution=.9) # doctest: +NORMALIZE_WHITESPACE
 F: 0.60. R: 0.65. S: 0.98.
- F: 0.71. R: 0.57. S: 0.98.
-- Gizmo is a Mogwaï. (R: 0.54; S: 0.99)
-- In chinese folklore, a Mogwaï is a demon. (R: 0.04; S: 0.71)
- This very long sentence, with a lot of stuff (R: 0.08; S: 0.69)
>>> gismo.post_features_cluster = post_features_cluster_print
>>> gismo.get_features_by_cluster() # doctest: +NORMALIZE_WHITESPACE
 F: 0.03. R: 0.29. S: 0.98.
- F: 1.00. R: 0.27. S: 0.99.
-- mogwaï (R: 0.12; S: 0.99)
-- gizmo (R: 0.12; S: 0.99)
-- is (R: 0.03; S: 0.99)
- F: 1.00. R: 0.02. S: 0.07.
-- in (R: 0.00; S: 0.07)
-- demon (R: 0.00; S: 0.07)
-- chinese (R: 0.00; S: 0.07)
-- folklore (R: 0.00; S: 0.07)

As an alternative to a textual query, the rank() method can directly use a vector z as input.

>>> z, s = gismo.embedding.query_projection("gizmo chinese folklore")
>>> z # doctest: +NORMALIZE_WHITESPACE
<1x36 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>
>>> s = gismo.rank(z=z)
>>> s
True
>>> gismo.get_documents_by_rank(k=2)
['In chinese folklore, a Mogwaï is a demon.', 'Gizmo is a Mogwaï.']
>>> gismo.get_features_by_rank()
['mogwaï', 'in', 'chinese', 'folklore', 'demon', 'gizmo', 'is']

The class also offers get_documents_by_coverage() and get_features_by_coverage() that yield a list of results obtained from a Covering-like traversal of the ranked cluster.

To demonstrate it, we first add an outsider document to the corpus and rebuild Gismo.

>>> new_entry = {'title': 'Minority Report', 'content': 'Totally unrelated stuff.'}
>>> corpus = Corpus(toy_source_dict+[new_entry], lambda x: x['content'])
>>> vectorizer = CountVectorizer(dtype=float)
>>> embedding = Embedding(vectorizer=vectorizer)
>>> embedding.fit_transform(corpus)
>>> gismo = Gismo(corpus, embedding)
>>> gismo.post_documents_item = post_documents_item_content
>>> success = gismo.rank("Gizmo")
>>> gismo.parameters.target_k = .3

Remind the classical rank-based result.

>>> gismo.get_documents_by_rank()
['Gizmo is a Mogwaï.', 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.', 'In chinese folklore, a Mogwaï is a demon.']

Gismo can use the cluster to propose alternate results that try to cover more subjects.

>>> gismo.get_documents_by_coverage()
['Gizmo is a Mogwaï.', 'Totally unrelated stuff.', 'This is a sentence about Blade.']

Note how the new entry, which has nothing to do with the rest, is pushed into the results. By setting the wide option to False, we get an alternative that focuses on mainstream results.

>>> gismo.get_documents_by_coverage(wide=False)
['Gizmo is a Mogwaï.', 'This is a sentence about Blade.', 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.']

The same principle applies for features.

>>> gismo.get_features_by_rank()
['mogwaï', 'gizmo', 'is', 'in', 'chinese', 'folklore', 'demon']

>>> gismo.get_features_by_coverage()
['mogwaï', 'this', 'in', 'by', 'gizmo', 'is', 'chinese']

get_documents_by_cluster(k=None, **kwargs)[source]¶

Returns a cluster of the best ranked documents. The cluster is by default post_processed through the post_documents_cluster method.

Parameters:	k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters. kwargs (dict, optional) – Custom runtime parameters.
Returns:
Return type:	object

get_documents_by_cluster_from_indices(indices, **kwargs)[source]¶

Returns a cluster of documents. The cluster is by default post_processed through the post_documents_cluster method.

Parameters:	indices (list of int) – The indices of documents to be processed. It is assumed that the documents are sorted by importance. kwargs (dict, optional) – Custom runtime parameters.
Returns:
Return type:	object

get_documents_by_coverage(k=None, **kwargs)[source]¶

Returns a list of top covering documents. By default, the documents are post_processed through the post_documents_item method.

Parameters:	k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters. kwargs (dict, optional) – Custom runtime parameters.
Returns:
Return type:	list

get_documents_by_rank(k=None, **kwargs)[source]¶

Returns a list of top documents according to the current ranking. By default, the documents are post_processed through the post_documents_item method.

Parameters:	k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters. kwargs (dict, optional) – Custom runtime parameters.
Returns:
Return type:	list

get_features_by_cluster(k=None, **kwargs)[source]¶

Returns a cluster of the best ranked features. The cluster is by default post_processed through the post_features_cluster method.

Parameters:	k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters. kwargs (dict, optional) – Custom runtime parameters.
Returns:
Return type:	object

get_features_by_cluster_from_indices(indices, **kwargs)[source]¶

Returns a cluster of features. The cluster is by default post_processed through the post_features_cluster method.

Parameters:	indices (list of int) – The indices of features to be processed. It is assumed that the features are sorted by importance. kwargs (dict, optional) – Custom runtime parameters
Returns:
Return type:	object

get_features_by_coverage(k=None, **kwargs)[source]¶

Returns a list of top covering features. By default, the features are post_processed through the post_features_item method.

Parameters:	k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters. kwargs (dict, optional) – Custom runtime parameters.
Returns:
Return type:	list

get_features_by_rank(k=None, **kwargs)[source]¶

Returns a list of top features according to the current ranking. By default, the features are post_processed through the post_features_item method.

Parameters:	k (int, optional) – Number of documents to output. If not set, k is automatically computed using the max_k and target_k runtime parameters. kwargs (dict, optional) – Custom runtime parameters.
Returns:
Return type:	list

rank(query='', z=None, **kwargs)[source]¶

Runs the Diteration using query as starting point

Parameters:	query (str) – Text that starts DIteration z (`csr_matrix`, optional) – Query vector to use in place of the textual query kwargs (dict, optional) – Custom runtime parameters.
Returns:	success – success of the query projection. If projection fails, a ranking on uniform distribution is performed.
Return type:	bool

class gismo.gismo.XGismo(x_embedding=None, y_embedding=None, filename=None, path='.', **kwargs)[source]¶

Given two distinct embeddings base on the same set of documents, builds a new gismo. The features of x_embedding are the corpus of this new gismo. The features of y_embedding are the features of this new gismo. The dual embedding of the new gismo is obtained by crossing the two input dual embeddings.

xgismo behaves essentially as a gismo object. The main difference is an additional parameter y for the rank method, to control if the query projection should be performed on the y_embedding or on the x_embedding.

Parameters:

x_embedding (Embedding) – The left embedding, which defines the documents of the xgismo.
y_embedding (Embedding) – The right embedding, which defines the features of the xgismo.
filename (str, optional) – If set, will load xgismo from file.
path (str or Path, optional) – Directory where the xgismo is to be loaded from.
kwargs (dict) – Custom default runtime parameters. You just need to specify the parameters that differ from DEFAULT_PARAMETERS.

Examples

One the main use case for XGismo consists in transforming a list of articles into a Gismo that relates authors and the words they use. Let’s start by retrieving a few articles.

>>> toy_url = "https://dblp.org/pers/xx/m/Mathieu:Fabien.xml"
>>> source = [a for a in url2source(toy_url) if int(a['year'])<2023]

Then we build the embedding of words.

>>> corpus = Corpus(source, to_text=lambda x: x['title'])
>>> w_count = CountVectorizer(dtype=float, stop_words='english')
>>> w_embedding = Embedding(w_count)
>>> w_embedding.fit_transform(corpus)

And the embedding of authors.

>>> to_authors_text = lambda dic: " ".join([a.replace(' ', '_') for a in dic['authors']])
>>> corpus.to_text = to_authors_text
>>> a_count = CountVectorizer(dtype=float, preprocessor=lambda x:x, tokenizer=lambda x: x.split(' '))
>>> a_embedding = Embedding(a_count)
>>> a_embedding.fit_transform(corpus)

We can now combine the two embeddings in one xgismo.

>>> xgismo = XGismo(a_embedding, w_embedding)
>>> xgismo.post_documents_item = lambda g, i: g.corpus[i].replace('_', ' ')

We can use xgismo to query keyword(s).

>>> success = xgismo.rank("Pagerank")
>>> xgismo.get_documents_by_rank()
['Mohamed Bouklit', 'Dohy Hong', 'The Dang Huynh']

We can use it to query researcher(s).

>>> success = xgismo.rank("Anne_Bouillard", y=False)
>>> xgismo.get_documents_by_rank()
['Anne Bouillard', 'Elie de Panafieu', 'Céline Comte', 'Thomas Deiß', 'Philippe Sehier', 'Dmitry Lebedev']

rank(query='', y=True, **kwargs)[source]¶

Runs the DIteration using query as starting point. query can be evaluated on features (y=True) or documents (y=False).

Parameters:	query (str) – Text that starts DIteration y (bool) – Determines if query should be evaluated on features (`True`) or documents (`False`). kwargs (dict, optional) – Custom runtime parameters.
Returns:	success – success of the query projection. If projection fails, a ranking on uniform distribution is performed.
Return type:	bool

Landmarks¶

class gismo.landmarks.Landmarks(source=None, to_text=None, **kwargs)[source]¶

The Landmarks class is a subclass Corpus. It offers the capability to batch-rank all its entries against a Gismo instance. After it has been processed, a Landmarks can be used to analyze/classify Gismo queries, Cluster, or Landmarks.

Landmarks also offers the possibility to reduce a source or a gismo to its neighborhood. This can be useful if the source is huge and one wants something smaller for performance.

Parameters:	source (list) – The list of items that form Landmarks. to_text (function) – The function that transforms an item into text kwargs (dict) – Custom default runtime parameters. You just need to specify the parameters that differ from `DEFAULT_LANDMARKS_PARAMETERS`.

Examples

Landmarks lean on a Gismo. We can use a toy Gismo to start with.

>>> corpus = Corpus(toy_source_text)
>>> embedding = Embedding()
>>> embedding.fit_transform(corpus)
>>> gismo = Gismo(corpus, embedding)
>>> print(toy_source_text) # doctest: +NORMALIZE_WHITESPACE
['Gizmo is a Mogwaï.',
'This is a sentence about Blade.',
'This is another sentence about Shadoks.',
'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.',
'In chinese folklore, a Mogwaï is a demon.']

Landmarks are constructed exactly like a Gismo object, with a source and a to_text function.

>>> landmarks_source = [{'name': 'Movies', 'content': 'Star Wars, Gremlins, and Blade are movies.'},
... {'name': 'Gremlins', 'content': 'The Gremlins movie features a Mogwai.'},
... {'name': 'Star Wars', 'content': 'The Star Wars movies feature Yoda.'},
... {'name': 'Shadoks', 'content': 'Shadoks is a French sarcastic show.'},]
>>> landmarks = Landmarks(landmarks_source, to_text=lambda e: e['content'])

The fit() method compute gismo queries for all landmarks and retain the results.

>>> landmarks.fit(gismo)

We run the request Yoda and look at the key landmarks. Note that Gremlins comes before Star Wars. This is actually correct in this small dataset: the word Yoda only exists in one sentence, which contains the words Gremlins and Gizmo.

>>> success = gismo.rank('yoda')
>>> landmarks.get_landmarks_by_rank(gismo) # doctest: +NORMALIZE_WHITESPACE
[{'name': 'Gremlins', 'content': 'The Gremlins movie features a Mogwai.'},
{'name': 'Star Wars', 'content': 'The Star Wars movies feature Yoda.'},
{'name': 'Movies', 'content': 'Star Wars, Gremlins, and Blade are movies.'}]

For better readibility, we set the item post_processing to return the name of a landmark item.

>>> landmarks.post_item = lambda lmk, i: lmk[i]['name']
>>> landmarks.get_landmarks_by_rank(gismo)
['Gremlins', 'Star Wars', 'Movies']

The balance adjusts between documents and features spaces. A balance set to 1.0 focuses only on documents.

>>> success = gismo.rank('blade')
>>> landmarks.get_landmarks_by_rank(gismo, balance=1)
['Movies']

A balance set to 0.0 focuses only on features. For blade, this triggers Shadoks as a secondary result, because of the shared word sentence.

>>> landmarks.get_landmarks_by_rank(gismo, balance=0)
['Movies', 'Shadoks']

Landmarks can be used to analyze landmarks.

>>> landmarks.get_landmarks_by_rank(landmarks)
['Gremlins', 'Star Wars']

See again how balance can change things. Here a balance set to 0.0 (using only features) fully changes the results.

>>> landmarks.get_landmarks_by_rank(landmarks, balance=0)
['Shadoks']

Like for Gismo, landmarks can provide clusters.

>>> success = gismo.rank('gizmo')
>>> landmarks.get_landmarks_by_cluster(gismo) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
{'landmark': {'name': 'Gremlins',
              'content': 'The Gremlins movie features a Mogwai.'},
              'focus': 0.999998...,
              'children': [{'landmark': {'name': 'Gremlins',
                                         'content': 'The Gremlins movie features a Mogwai.'},
                                         'focus': 1.0, 'children': []},
                           {'landmark': {'name': 'Star Wars',
                                         'content': 'The Star Wars movies feature Yoda.'},
                                         'focus': 1.0, 'children': []},
                           {'landmark': {'name': 'Movies',
                                         'content': 'Star Wars, Gremlins, and Blade are movies.'},
                                         'focus': 1.0, 'children': []}]}

We can set the post_cluster attribute to customize the output. Gismo provides a simple display.

>>> from gismo.post_processing import post_landmarks_cluster_print
>>> landmarks.post_cluster = post_landmarks_cluster_print
>>> landmarks.get_landmarks_by_cluster(gismo) # doctest: +NORMALIZE_WHITESPACE
F: 1.00.
- Gremlins
- Star Wars
- Movies

Like for Gismo, parameters like k, distortion, or resolution can be used.

>>> landmarks.get_landmarks_by_cluster(gismo, k=4, distortion=False, resolution=.9) # doctest: +NORMALIZE_WHITESPACE
F: 0.03.
- F: 0.93.
-- F: 1.00.
--- Gremlins
--- Star Wars
-- Movies
- Shadoks

Note that a Cluster can also be used as reference for the get_landmarks_by_rank() and get_landmarks_by_cluster() methods.

>>> cluster = landmarks.get_landmarks_by_cluster(gismo, post=False)
>>> landmarks.get_landmarks_by_rank(cluster)
['Gremlins', 'Star Wars', 'Movies']

Yet, you cannot use anything as reference. For example, you cannot use a string as such.

>>> landmarks.get_landmarks_by_rank("Landmarks do not use external queries (pass them to a gismo")  # doctest.ELLIPSIS
Traceback (most recent call last):
...
TypeError: bad operand type for unary -: 'NoneType'

Last but not least, landmarks can be used to reduce the size of a source or a Gismo. The reduction is controlled by the x_density attribute that tells the number of documents each landmark will allow to keep.

>>> landmarks.parameters.x_density = 1
>>> reduced_gismo = landmarks.get_reduced_gismo(gismo)
>>> reduced_gismo.corpus.source # doctest: +NORMALIZE_WHITESPACE
['This is another sentence about Shadoks.',
'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the
Gremlins movie by comparing Gizmo and Yoda.']

Side remark #1: in the constructor, to_text indicates how to convert an item to str, while ranking_function specifies how to run a query on a Gismo. Yet, it is possible to have the text conversion handled by the ranking_function.

>>> landmarks = Landmarks(landmarks_source, rank=lambda g, q: g.rank(q['content']))
>>> landmarks.fit(gismo)
>>> success = gismo.rank('yoda')
>>> landmarks.post_item = lambda lmk, i: lmk[i]['name']
>>> landmarks.get_landmarks_by_rank(gismo)
['Star Wars', 'Movies', 'Gremlins']

However, this is bad practice. When you only need to customize the way an item is converted to text, you should stick to to_text. ranking_function is for more elaborated filters that require to change the default way gismo does queries.

Side remark #2: if a landmark item query fails (its text does not intersect the gismo features), the default uniform projection will be used and a warning will be issued. This may yield to undesired results.

>>> landmarks_source.append({'name': 'unrelated', 'content': 'unrelated.'})
>>> landmarks = Landmarks(landmarks_source, to_text=lambda e: e['content'])
>>> landmarks.fit(gismo)
>>> success = gismo.rank('gizmo')
>>> landmarks.post_item = lambda lmk, i: lmk[i]['name']
>>> landmarks.get_landmarks_by_rank(gismo)
['Shadoks', 'unrelated']

fit(gismo, **kwargs)[source]¶

Runs gismo queries on all landmarks. The relevance results are used to build two set of vectors: x_vectors is the vectors on the document space; y_vectors is the vectors on the document space. On each space, vectors are summed to build a direction, which is a sort of vector summary of the landmarks.

Parameters:	gismo (Gismo) – The Gismo on which vectors will be computed. kwargs (dict) – Custom Landmarks runtime parameters.
Returns:
Return type:	None

gismo.landmarks.get_direction(reference, balance)[source]¶

Converts a reference object into a n+m direction (dense or sparse depending on reference type).

Parameters:	reference (Gismo or Landmarks or Cluster or np.ndarray or csr_matrix.) – The object from which a direction will be extracted. balance (float in range [0.0, 1.0]) – The trade-off between documents and features. Set to 0.0, only the feature space will be used. Set to 1.0, only the document space will be used.
Returns:	A n+m direction.
Return type:	np.ndarray or csr_matrix

Post Processing¶

gismo.post_processing.post_documents_cluster_json(gismo, cluster)[source]¶

Convert cluster of documents into basic json

Parameters:	gismo (Gismo) – Gismo instance cluster (Cluster) – Cluster of documents
Returns:	dictionary with keys ‘document’, ‘focus’, and recursive ‘children’
Return type:	dict

gismo.post_processing.post_documents_cluster_print(gismo, cluster, post_item=None, depth='')[source]¶

Print an ASCII view of a document cluster with metrics (focus, relevance, similarity)

Parameters:	gismo (Gismo) – Gismo instance cluster (Cluster) – Cluster of documents post_item (function, optional) – Post-processing function for individual documents depth (str, optional) – Current depth string used in recursion

gismo.post_processing.post_documents_item_content(gismo, i, max_size=None)[source]¶

Document indice to document content.

Assumes that document has a ‘content’ key.

Parameters:	gismo (Gismo) – Gismo instance i (int) – document indice max_size (int, optional) – Maximum number of chars to return
Returns:	Content of document i from corpus
Return type:	str

gismo.post_processing.post_documents_item_raw(gismo, i)[source]¶

Document indice to document entry

Parameters:	gismo (Gismo) – Gismo instance i (int) – document indice
Returns:	Document i from corpus
Return type:	object

gismo.post_processing.post_features_cluster_json(gismo, cluster)[source]¶

Convert feature cluster into basic json

Parameters:	gismo (Gismo) – Gismo instance cluster (Cluster) – Cluster of features
Returns:	dictionary with keys ‘feature’, ‘focus’, and recursive ‘children’
Return type:	dict

gismo.post_processing.post_features_cluster_print(gismo, cluster, post_item=None, depth='')[source]¶

Print an ASCII view of a feature cluster with metrics (focus, relevance, similarity)

Parameters:	gismo (Gismo) – Gismo instance cluster (Cluster) – Cluster of features post_item (function, optional) – Post-processing function for individual features depth (str, optional) – Current depth string used in recursion

gismo.post_processing.post_features_item_raw(gismo, i)[source]¶

Feature indice to feature name

Parameters:	gismo (Gismo) – Gismo instance i (int) – feature indice
Returns:	Feature i from embedding
Return type:	str

gismo.post_processing.post_landmarks_cluster_json(landmark, cluster)[source]¶

Default post processor for a cluster of landmarks.

Parameters:	landmark (Landmarks) – A Landmarks instance cluster (Cluster) – Cluster of the landmarks to process.
Returns:	A dict with the head landmark, cluster focus, and list of children.
Return type:	dict

gismo.post_processing.post_landmarks_cluster_print(landmark, cluster, post_item=None, depth='')[source]¶

ASCII display post processor for a cluster of landmarks.

Parameters:	landmark (Landmarks) – A Landmarks instance cluster (Cluster) – Cluster of the landmarks to process. post_item (function, optional) – Post-processing function for individual landmarks depth (str, optional) – Current depth string used in recursion

gismo.post_processing.post_landmarks_item_raw(landmark, i)[source]¶

Default post processor for individual landmarks.

Parameters:	landmark (Landmarks) – A Landmarks instance i (int) – Indice of the landmark to process.
Returns:	The landmark of indice i.
Return type:	object

FileSource¶

class gismo.filesource.FileSource(filename='mysource', path='.', load_source=False)[source]¶

Yield a file source as a list. Assumes the existence of two files: The mysource.data file contains the stacked items. Each item is compressed with zlib; The mysource.index files contains the list of pointers to seek items in the data file.

The resulting source object is fully compatible with the Corpus class:

It can be iterated ([item for item in source]);
It can yield single items (source[i]);
It has a length (len(source)).

More advanced functionalities like slices are not implemented.

Parameters:	path (str) – Location of the files filename (str) – Stem of the file load_source (bool) – Should the data be loaded in RAM

Examples

>>> import tempfile
>>> with tempfile.TemporaryDirectory() as dirname:
...    create_file_source(filename='mysource', path=dirname)
...    source = FileSource(filename='mysource', path=dirname, load_source=True)
...    content = [e['content'] for e in source]
>>> content[:3]
['Gizmo is a Mogwaï.', 'This is a sentence about Blade.', 'This is another sentence about Shadoks.']

Note: when source is read from file (load_source=False, default behavior), you need to close the source afterwards to avoid pending file handles.

>>> with tempfile.TemporaryDirectory() as dirname:
...    create_file_source(filename='mysource', path=dirname)
...    source = FileSource(filename='mysource', path=dirname)
...    size = len(source)
...    item = source[0]
...    source.close()
>>> size
5
>>> item
{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'}

gismo.filesource.create_file_source(source=None, filename='mysource', path='.')[source]¶

Write a source (list of dict) to files in the same format used by FileSource. Only useful to transfer from a computer with a lot of RAM to a computer with less RAM. For more complex cases, e.g. when the initial source itself is a very large file, a dedicated converter has to be provided.

Parameters:	source (list of dict) – The source to write filename (str) – Stem of the file. Two files will be created, with suffixes .index and .data. path (str or Path) – Destination directory

Sentencizer¶

class gismo.sentencizer.Sentencizer(gismo)[source]¶

The Sentencizer class allows to refine a document-level gismo into a sentence-level gismo. A simple sentence extraction is proposed. For more complex usages, the class can provide a full Gismo instance that operates at sentence-level.

Parameters:	gismo (Gismo) – Document-level Gismo.

Examples

We use the C50 Reuters dataset (5000 news paragraphs).

>>> from gismo.datasets.reuters import get_reuters_news
>>> corpus = Corpus(get_reuters_news(), to_text=lambda e: e['content'])
>>> embedding = Embedding()
>>> embedding.fit_transform(corpus)
>>> gismo = Gismo(corpus, embedding)
>>> sentencer = Sentencizer(gismo)

First Example: run explicitly the query Orange at document-level, extract 4 covering sentences with narrow bfs.

>>> success = gismo.rank("Orange")
>>> sentencer.get_sentences(s=4, wide=False) # doctest: +NORMALIZE_WHITESPACE
['Snook says the all important average retained revenue per Orange subscriber will
rise from around 442 pounds per year, partly because dominant telecoms player British
Telecommunications last month raised the price of a call to Orange phones from its fixed lines.',
'Analysts said that Orange shares had good upside potential after a rollercoaster ride in their
short time on the market.',
'Orange, which was floated last March at 205 pence per share, initially saw its stock slump to
157.5 pence before recovering over the last few months to trade at 218 on Tuesday, a rise of four
pence on the day.',
'One-2-One and Orange ORA.L, which offer only digital services, are due to release their
connection figures next week.']

Second example: extract Ericsson-related sentences

>>> sentencer.get_sentences(query="Ericsson") # doctest: +NORMALIZE_WHITESPACE
['These latest wins follow a recent $350 million contract win with Telefon AB L.M. Ericsson,
bolstering its already strong activity in the contract manufacturing of telecommuncation and
data communciation products, he said.',
'The restraints are few in areas such as consumer products, while in sectors such as banking,
distribution and insurance, foreign firms are kept on a very tight leash.',
"The company also said it had told analysts in a briefing Tuesday of new contract wins with
Ascend Communications Inc, Harris Corp's Communications unit and Philips Electronics NV.",
'Pocket is the first from the high-priced 1996 auction known to have filed for bankruptcy
protection.',
'With Ascend in particular, he said the company would be manufacturing the company\'s
mainstream MAX TNT remote access network equipment. "']

Third example: extract Communications-related sentences from a string.

>>> txt = gismo.corpus[4517]['content']
>>> sentencer.get_sentences(query="communications", txt=txt) # doctest: +NORMALIZE_WHITESPACE
["Privately-held Pocket's big creditors include a group of Asian entrepreneurs and
communications-equipment makers Siemens AG of Germany and L.M. Ericsson of Sweden.",
"2 bidder at the government's high-flying wireless phone auction last year has filed for
bankruptcy protection from its creditors, underscoring the problems besetting the
auction's winners.",
"The Federal Communications Commission on Monday gave PCS companies from last year's
auction some breathing space when it suspended indefinitely a March 31 deadline for
them to make payments to the agency for their licenses."]

get_sentences(query=None, txt=None, k=None, s=None, resolution=0.7, stretch=2.0, wide=True, post=True)[source]¶

All-in-one method to extract covering sentences from the corpus. Computes sentence-level corpus, sentence-level gismo, and calls get_documents_by_coverage().

Parameters:	query (str (optional)) – Query to run on the document-level Gismo txt (str (optional)) – Text to use for sentence extraction. If not set, the sentences will be extracted from the top-documents. k (int (optional)) – Number of top-documents used for the built. If not set, the `auto_k()` heuristic of the document-level Gismo will be used. s (int (optional)) – Number of sentences to return. If not set, the `auto_k()` heuristic of the sentence-level Gismo will be used. resolution (float (optional)) – Tree resolution passed to the `get_documents_by_coverage()` method. stretch (float >= 1 (optional)) – Stretch factor passed to the `get_documents_by_coverage()` method. wide (bool (optional)) – bfs wideness passed to the `get_documents_by_coverage()` method. post (bool (optional)) – Use of post-proccessing passed to the `get_documents_by_coverage()` method.
Returns:
Return type:	list

make_sent_gismo(query=None, txt=None, k=None, **kwargs)[source]¶

Construct a sentence-level Gismo stored in the sent_gismo attribute.

Parameters:	query (str (optional)) – Query to run on the document-level Gismo. txt (str (optional)) – Text to use for sentence extraction. If not set, the sentences will be extracted from the top-documents. k (int (optional)) – Number of top-documents used for the built. If not set, the `auto_k()` heuristic will be used. kwargs (dict) – Custom default runtime parameters to pass to the sentence-level Gismo. You just need to specify the parameters that differ from `DEFAULT_PARAMETERS`. Note that distortion will be automatically de-activated. If you really want it, manually change the value of `self.sent_gismo.parameters.distortion` afterwards.
Returns:
Return type:	Sentencizer

splitter(txt)[source]¶

Transform input content into a corpus of sentences stored into the sent_corpus attribute.

Parameters:	txt (str or list) – Text or list of documents to split in sentences. For the latter, documents are assumed to be provided as (content, id) pairs, where content is the actual text and id a reference of the document.
Returns:
Return type:	Sentencizer

Datasets¶

gismo.datasets.acm.flatten_acm(acm, min_size=5, max_depth=100, exclude=None, depth=0)[source]¶

Select subdomains of an acm tree and return them as a list.

Parameters:	acm (list of dicts) – acm tree from get_acm. min_size (int) – size threshold to select a domain (avoids small domains) max_depth (int) – depth threshold to select a domain (avoids deep domains) exclude (list) – list of domains to exclude from the results
Returns:	A flat list of domains described by name and query.
Return type:	list

Example

>>> acm = flatten_acm(get_acm())
>>> acm[111]['name']
'Graph theory'

gismo.datasets.acm.get_acm(refresh=False)[source]¶

Parameters:	refresh (bool) – If `True`, builds a new forest from the Internet, otherwise use a static version.
Returns:	acm – Each dict is an ACM domain. It contains category name, query (concatenation of names from domain and subdomains), size (number of subdomains including itself), and children (list of domain dicts).
Return type:	list of dicts

Examples

>>> acm = get_acm()
>>> subdomain = acm[4]['children'][2]['children'][1]
>>> subdomain['name']
'Software development process management'
>>> subdomain['size']
10
>>> subdomain['query']
'Software development process management, Software development methods, Rapid application development, Agile software development, Capability Maturity Model, Waterfall model, Spiral model, V-model, Design patterns, Risk management'

>>> acm = get_acm(refresh=True)
>>> len(acm)
13

gismo.datasets.dblp.DEFAULT_FIELDS = {'authors', 'title', 'type', 'venue', 'year'}¶: Default fields to extract.

gismo.datasets.dblp.DTD_URL = 'https://dblp.uni-trier.de/xml/dblp.dtd'¶: URL of the dtd file (required to correctly parse non-ASCII characters).

class gismo.datasets.dblp.Dblp(dblp_url='https://dblp.uni-trier.de/xml/dblp.xml.gz', filename='dblp', path='.')[source]¶

The DBLP class can download DBLP database and produce source files compatible with the FileSource class.

Parameters:	dblp_url (str, optional) – Alternative URL for the dblp.xml.gz file filename (str) – Stem of the files (suffixes will be appened) path (str or path, optional) – Destination of the files

build(refresh=False, d=2, fields=None)[source]¶

Main class method. Create the data and index files.

Parameters:	refresh (bool) – Tell if files are to be rebuilt if they are already there. d (int) – depth level where articles are. Usually 2 or 3 (2 for the main database). fields (set, optional) – Set of fields to collect. Default to `DEFAULT_FIELDS`.

Example

By default, the class downloads the full dataset. Here we will limit to one entry.

>>> toy_url = "https://dblp.org/pers/xx/m/Mathieu:Fabien.xml"
>>> import tempfile
>>> from gismo.filesource import FileSource
>>> tmp = tempfile.TemporaryDirectory()
>>> dblp = Dblp(dblp_url=toy_url, path=tmp.name)
>>> dblp.build() # doctest.ELLIPSIS
Retrieve https://dblp.org/pers/xx/m/Mathieu:Fabien.xml from the Internet.
DBLP database downloaded to ...xml.gz.
Converting DBLP database from ...xml.gz (may take a while).
Building Index.
Conversion done.

By default, build uses existing files.

>>> dblp.build() # doctest.ELLIPSIS
File ...xml.gz already exists. Use refresh option to overwrite.
File ...data already exists. Use refresh option to overwrite.

The refresh parameter can be used to ignore existing files.

>>> dblp.build(d=3, refresh=True) # doctest.ELLIPSIS
Retrieve https://dblp.org/pers/xx/m/Mathieu:Fabien.xml from the Internet.
DBLP database downloaded to ...xml.gz.
Converting DBLP database from ...xml.gz (may take a while).
Building Index.
Conversion done.

The resulting files can be used to create a FileSource.

>>> source = FileSource(filename="dblp", path=tmp.name)
>>> art = [s for s in source if s['title']=="Can P2P networks be super-scalable?"][0]
>>> art['authors'] # doctest.ELLIPSIS
['François Baccelli', 'Fabien Mathieu', 'Ilkka Norros', 'Rémi Varloot']

Don’t forget to close source after use.

>>> source.close()
>>> tmp.cleanup()

gismo.datasets.dblp.LIST_TYPE_FIELDS = {'authors', 'urls'}¶: DBLP fields with possibly multiple entries.

gismo.datasets.dblp.URL = 'https://dblp.uni-trier.de/xml/dblp.xml.gz'¶: URL of the full DBLP database.

gismo.datasets.dblp.element_to_filesource(elt, data_handler, index, fields)[source]¶

Converts the xml element elt into a dict if it is an article.
Compress and write the dict in data_handler
Append file position in data_handler to index.

Parameters:	elt (Any) – a XML element. data_handler (file_descriptor) – Where the compressed data will be stored. Must be writable. index – a list that contains the initial position of the data_handler for all previously processed elements. fields (set) – Set of fields to retrieve.
Returns:	Always return True for compatibility with the xml parser.
Return type:	bool

gismo.datasets.dblp.element_to_source(elt, source, fields)[source]¶

Test if elt is an article, converts it to dictionary and appends to source

Parameters:	elt (Any) – a XML element. source (list) – the source in construction. fields (set) – Set of fields to retrieve.

gismo.datasets.dblp.fast_iter(context, func, d=2, **kwargs)[source]¶

Applies func to all xml elements of depth 1 of the xml parser context. ` **kwargs are passed to func.

Modified version of a modified version of Liza Daly’s fast_iter Inspired by https://stackoverflow.com/questions/4695826/efficient-way-to-iterate-through-xml-elements

Parameters:	context (XMLparser) – A parser obtained from etree.iterparse func (function) – How to process the elements d (int, optional) – Depth to process elements.

gismo.datasets.dblp.url2source(url, fields=None)[source]¶

Directly transform URL of a dblp xml into a list of dictionnary. Only use for datasets that fit into memory (e.g. articles from one author). If the dataset does not fit, consider using the Dblp class instead.

Parameters:	url (str) – the URL to fetch. fields (set) – Set of DBLP fields to capture.
Returns:	source – Articles retrieved from the URL
Return type:	list of dict

Example

>>> source = url2source("https://dblp.org/pers/xx/t/Tixeuil:S=eacute=bastien.xml", fields={'authors', 'title', 'year', 'venue', 'urls'})
>>> art = [s for s in source if s['title']=="Distributed Computing with Mobile Robots: An Introductory Survey."][0]
>>> art['authors']
['Maria Potop-Butucaru', 'Michel Raynal', 'Sébastien Tixeuil']
>>> art['urls']
['https://doi.org/10.1109/NBiS.2011.55', 'http://doi.ieeecomputersociety.org/10.1109/NBiS.2011.55']

gismo.datasets.dblp.xml_element_to_dict(elt, fields)[source]¶

Converts the xml element elt into a dict if it is a paper.

Parameters:	elt (Any) – a XML element. fields (set) – Set of entries to retrieve.
Returns:	Article dictionary if element contains the attributes of an article, None otherwise.
Return type:	dict or None

gismo.datasets.reuters.get_reuters_entry(name, z)[source]¶

Read the Reuters news referenced by name in the zip archive z and returns it as a dict.

Parameters:	name (str) – Location of the file inside the Reuters archive z (ZipFile) – Zipfile descriptor of the Reuters archive
Returns:	entry – dict with keys set (C50test or c50train), author, id, and content
Return type:	dict

gismo.datasets.reuters.get_reuters_news(url='https://github.com/balouf/datasets/raw/main/C50.zip')[source]¶

Returns a list of news from the Reuters C50 news datasets

Acknowledgments

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

ZhiLiu, e-mail: liuzhi8673 ‘@’ gmail.com, institution: National Engineering Research Center for E-Learning, Hubei Wuhan, China

Parameters:	url (str) – Location of the C50 dataset
Returns:	The C50 news as a list of dict
Return type:	list

Example

Cf Sentencizer

Common¶

class gismo.common.MixInIO[source]¶

Provide basic save/load capacities to other classes.

dump(filename: str, path='.', overwrite=False, compress=True)[source]¶

Save instance to file.

Parameters:	filename (str) – The stem of the filename. path (`str` or `Path`, optional) – The location path. overwrite (bool) – Should existing file be erased if it exists? compress (bool) – Should gzip compression be used?

Examples

>>> import tempfile
>>> v1 = ToyClass(42)
>>> v2 = ToyClass()
>>> v2.value
0
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...     v1.dump(filename='myfile', compress=True, path=tmpdirname)
...     dir_content = [file.name for file in Path(tmpdirname).glob('*')]
...     v2 = ToyClass.load(filename='myfile', path=Path(tmpdirname))
...     v1.dump(filename='myfile', compress=True, path=tmpdirname) # doctest.ELLIPSIS
File ...myfile.pkl.gz already exists! Use overwrite option to overwrite.
>>> dir_content
['myfile.pkl.gz']
>>> v2.value
42

>>> with tempfile.TemporaryDirectory() as tmpdirname:
...     v1.dump(filename='myfile', compress=False, path=tmpdirname)
...     v1.dump(filename='myfile', compress=False, path=tmpdirname) # doctest.ELLIPSIS
File ...myfile.pkl already exists! Use overwrite option to overwrite.

>>> v1.value = 51
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...     v1.dump(filename='myfile', path=tmpdirname, compress=False)
...     v1.dump(filename='myfile', path=tmpdirname, overwrite=True, compress=False)
...     v2 = ToyClass.load(filename='myfile', path=tmpdirname)
...     dir_content = [file.name for file in Path(tmpdirname).glob('*')]
>>> dir_content
['myfile.pkl']
>>> v2.value
51

>>> with tempfile.TemporaryDirectory() as tmpdirname:
...    v2 = ToyClass.load(filename='thisfilenamedoesnotexist')
Traceback (most recent call last):
 ...
FileNotFoundError: [Errno 2] No such file or directory: ...

classmethod load(filename: str, path='.')[source]¶

Load instance from file.

Parameters:	filename (str) – The stem of the filename. path (`str` or `Path`, optional) – The location path.

class gismo.common.ToyClass(value=0)[source]¶

gismo.common.auto_k(data, order=None, max_k=100, target=1.0)[source]¶

Proposes a threshold k of significant values according to a relevance vector.

Parameters:	data (`ndarray`) – Vector with positive relevance values. order (list of int, optional) – Ordered indices of `data` max_k (int) – Maximal number of entries to return; also number of entries used to determine threshold. target (float) – Threshold modulation. Higher target means less result. A target set to 1.0 corresponds to using the average of the max_k top values as threshold.
Returns:	k – Recommended number of values.
Return type:	int

Example

>>> data = np.array([30, 1, 2, .3, 4, 50, 80])
>>> auto_k(data)
3

gismo.common.toy_source_dict = [{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'}, {'title': 'Second Document', 'content': 'This is a sentence about Blade.'}, {'title': 'Third Document', 'content': 'This is another sentence about Shadoks.'}, {'title': 'Fourth Document', 'content': 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.'}, {'title': 'Fifth Document', 'content': 'In chinese folklore, a Mogwaï is a demon.'}]¶: A minimal source example where items are dict with keys title and content.

gismo.common.toy_source_text = ['Gizmo is a Mogwaï.', 'This is a sentence about Blade.', 'This is another sentence about Shadoks.', 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.', 'In chinese folklore, a Mogwaï is a demon.']¶: A minimal source example where items are str.

Parameters¶

gismo.parameters.ALPHA = 0.5¶: Default value for damping factor. Controls the trade-off between closeness and centrality.

gismo.parameters.BALANCE = 0.5¶: Default documents/features trade-off in Landmarks.

gismo.parameters.DEFAULT_LANDMARKS_PARAMETERS = {'balance': 0.5, 'distortion': 1.0, 'max_k': 100, 'post': True, 'rank': <function <lambda>>, 'resolution': 0.7, 'stretch': 2.0, 'target_k': 1.0, 'wide': True, 'x_density': 1000, 'y_density': 1000}¶: Dictionary of default runtime Landmarks parameters.

gismo.parameters.DEFAULT_PARAMETERS = {'alpha': 0.5, 'distortion': 1.0, 'max_k': 100, 'memory': 0.0, 'n_iter': 4, 'offset': 1.0, 'post': True, 'resolution': 0.7, 'stretch': 2.0, 'target_k': 1.0, 'wide': True}¶: Dictionary of default runtime Gismo parameters.

gismo.parameters.DISTORTION = 1.0¶: Default distortion. Controls how much of diteration relevance is mixed into the embedding for similarity computation.

gismo.parameters.MAX_K = 100¶: Default top population size for estimating k.

gismo.parameters.MEMORY = 0.0¶: Default memory value. Controls how much of previous computation is kept when performing a new diffusion.

gismo.parameters.N_ITER = 4¶: Default value for the number of round-trip diffusions to perform. Higher value means better precision but longer execution time.

gismo.parameters.OFFSET = 1.0¶: Default offset value. Controls how much of the initial fluid should be deduced from the relevance.

gismo.parameters.POST = True¶: Default post policy. If True, post function is applied on items and clusters.

class gismo.parameters.Parameters(parameter_list=None, **kwargs)[source]¶

Manages Gismo runtime parameters. When called, an instance will yield a dictionary of parameters. Is also used for other Gismo classes like Landmarks.

Parameters:	parameter_list (dict, optional) – Indicates which paramaters to manage. Default to Gismo runtime parameter. kwargs (dict) – Parameters that need to be distinct from default values.

Examples

Use default parameters.

>>> p = Parameters()
>>> p() # doctest: +NORMALIZE_WHITESPACE
{'alpha': 0.5, 'n_iter': 4, 'offset': 1.0, 'memory': 0.0,
'stretch': 2.0, 'resolution': 0.7, 'max_k': 100, 'target_k': 1.0,
'wide': True, 'post': True, 'distortion': 1.0}

Use default parameters with changed stretch.

>>> p = Parameters(stretch=1.7)
>>> p()['stretch']
1.7

Note that parameters that do not exist will be ignored and (a warning will be issued)

>>> p = Parameters(strech=1.7)
>>> p() # doctest: +NORMALIZE_WHITESPACE
{'alpha': 0.5, 'n_iter': 4, 'offset': 1.0, 'memory': 0.0,
'stretch': 2.0, 'resolution': 0.7, 'max_k': 100, 'target_k': 1.0,
'wide': True, 'post': True, 'distortion': 1.0}

You can change the value of an attribute to alter the returned parameter.

>>> p.alpha = 0.85
>>> p()['alpha']
0.85

You can also apply on-the-fly parameters by passing them when calling the instance.

>>> p(resolution=0.9)['resolution']
0.9

Like for construction, parameters that do not exist are ignored and a warning is issued.

>>> p(resolutio = .9) # doctest: +NORMALIZE_WHITESPACE
{'alpha': 0.85, 'n_iter': 4, 'offset': 1.0, 'memory': 0.0,
'stretch': 2.0, 'resolution': 0.7, 'max_k': 100, 'target_k': 1.0,
'wide': True, 'post': True, 'distortion': 1.0}

Note the possibility to store a custom set of parameters if one uses parameter_list in construction.

>>> p = Parameters(parameter_list={'a': 1.0, 'b': True}, a=1.5)
>>> p()
{'a': 1.5, 'b': True}

gismo.parameters.RESOLUTION = 0.7¶: Default resolution value. Defines how strict the merging of cluster is during recursive clustering.

gismo.parameters.STRETCH = 2.0¶: Default stretch value. When performing covering, defines the ratio between considered pages and selected covering pages.

gismo.parameters.TARGET_K = 1.0¶: Default threshold for estimating k.

gismo.parameters.WIDE = True¶: Default Covering behavior for covering. True for wide variant, false for core variant.

gismo.parameters.X_DENSITY = 1000¶: Default number of documents representing a Landmarks entry.

gismo.parameters.Y_DENSITY = 1000¶: Default number of features representing a Landmarks entry.