Landmarks Tutorial¶

In this notebook, we will use the landmarks submodule of Gismo to give an interactive description of ACM topics and researchers of the https://www.lincs.fr laboratory.

This notebook can be used as a blueprint to analyze other group of people under the scope of a topic classification.

Before starting this topic, it is recommended to have looked at the ACM and DBLP tutorials.

Lincs Researchers¶

In this section, we bind the LINCS researchers with their DBLP id.

List of DBLP researchers¶

First, we open the DBLP database (see the DBLP Tutorial to get your copy of the database).

[1]:

from pathlib import Path

path = Path("../../../../../../Datasets/DBLP")

from gismo.filesource import FileSource
source = FileSource(filename="dblp", path=path)

source is a list-like object whose entries are dicts that describe articles.

[2]:

source[1234567]

[2]:

{'type': 'article',
 'authors': ['Andrzej M. Borzyszkowski', 'Philippe Darondeau'],
 'title': 'Transition systems without transitions.',
 'year': '2005',
 'venue': 'Theor. Comput. Sci.'}

Let’s extract the set of authors. Each author is lowered and spaces are replaced with underscore for better later processing.

[3]:

dblp_authors = {a.replace(" ", "_") for paper in source for a in paper['authors']}

[4]:

"Fabien_Mathieu" in dblp_authors

[4]:

True

[5]:

"Fabin_Mathieu" in dblp_authors

[5]:

False

List of Lincs Members¶

First we get a copy of the LINCS webpage that tells its researchers and feed it to BeautifulSoup

[6]:

import requests
from bs4 import BeautifulSoup as bs
soup = bs(requests.get('https://www.lincs.fr/people/').text)

We make a function to convert table rows of the HTML page into researcher dict. Each dict will have a name (display name) and a dblp (DBLP id) entry.

[7]:

from bof.fuzz import Process

p = Process()
p.fit(list(dblp_authors))

[8]:

def row2dict(row, dblp_authors, manual=None):
    """
    Soup 2 dict conversion

    Parameters
    ----------
    line: soup
        The row to convert
    dblp_authors: set
        The list of DBLP authors
    manual: dict
        Manual associations between name and id

    Returns
    -------
    dict
        A dict shaped like {'name': "John Doe", 'dblp': "john_doe"}
    """
    if manual is None:
        manual = {}
    # name extraction
    row = row('td')[1]
    name = row.text
    # manual association
    if name in manual:
        return {'name': name, 'dblp': manual[name]}
    # Attempt direct transformation
    dblp = name.replace(" ", "_")
    # If result exists in dblp, return that
    if dblp in dblp_authors:
        return {'name': name, 'dblp': dblp}
    # Attempt to use the lincs automatic URL to infer dblp name
    a = row.a
    if a:
        href = a['href'].split('?more=')[1]
        if href in dblp_authors:
            return {'name': name, 'dblp': href}
    # last chance: use bof to guess the good answer.
    print(f"No direct dblp entry found for {name}, start fuzzy search")
    candidate = p.extractOne(name.lower().replace(" ", "_"))
#     candidates = get_close_matches(name.lower().replace(" ", "_"), candidates, cutoff=0.3)
    if candidate:
        print(f"Found candidate: {candidate}")
        dblp = candidate[0]
        return {'name': name, 'dblp': dblp}
    # If all failed, return empty id
    return {'name': name, 'dblp': ""}

The manual override below was actually populated by executing the cell afterwards and iterating until all things were OK.

[9]:

manual = {"Giovanni Pau": "Giovanni_Pau_0001"}

The actual construction of the list of LINCS researchers.

[10]:

lincs = [row2dict(line, dblp_authors, manual) for table in soup('table')[:2] for line in table('tr')]

No direct dblp entry found for Ana Bušić, start fuzzy search
Found candidate: ('Ana_Busic', 41.666666666666664)
No direct dblp entry found for Chung Shue (Calvin) Chen, start fuzzy search
Found candidate: ('Chung_Shue_Chen', 48.07692307692308)
No direct dblp entry found for Joaquin Garcia-Alfaro, start fuzzy search
Found candidate: ('Joaquin_Garcia', 65.51724137931035)
No direct dblp entry found for Remi Varloot, start fuzzy search
Found candidate: ('Rémi_Varloot', 68.42105263157895)

[11]:

lincs

[11]:

[{'name': 'Alonso Silva', 'dblp': 'Alonso_Silva'},
 {'name': 'Ana Bušić', 'dblp': 'Ana_Busic'},
 {'name': 'Anaïs Vergne', 'dblp': 'Anaïs_Vergne'},
 {'name': 'Bartek Blaszczyszyn', 'dblp': 'Bartek_Blaszczyszyn'},
 {'name': 'Chung Shue (Calvin) Chen', 'dblp': 'Chung_Shue_Chen'},
 {'name': 'Daniel Kofman', 'dblp': 'Daniel_Kofman'},
 {'name': 'Eitan Altman', 'dblp': 'Eitan_Altman'},
 {'name': 'François Baccelli', 'dblp': 'François_Baccelli'},
 {'name': 'Laurent Decreusefond', 'dblp': 'Laurent_Decreusefond'},
 {'name': 'Ludovic Noirie', 'dblp': 'Ludovic_Noirie'},
 {'name': 'Makhlouf Hadji', 'dblp': 'Makhlouf_Hadji'},
 {'name': 'Marc Lelarge', 'dblp': 'Marc_Lelarge'},
 {'name': 'Marceau Coupechoux', 'dblp': 'Marceau_Coupechoux'},
 {'name': 'Maria Potop-Butucaru', 'dblp': 'Maria_Potop-Butucaru'},
 {'name': 'Petr Kuznetsov', 'dblp': 'Petr_Kuznetsov'},
 {'name': 'Philippe Jacquet', 'dblp': 'Philippe_Jacquet'},
 {'name': 'Philippe Martins', 'dblp': 'Philippe_Martins'},
 {'name': 'Renata Teixeira', 'dblp': 'Renata_Teixeira'},
 {'name': 'Serge Fdida', 'dblp': 'Serge_Fdida'},
 {'name': 'Sébastien Tixeuil', 'dblp': 'Sébastien_Tixeuil'},
 {'name': 'Thomas Bonald', 'dblp': 'Thomas_Bonald'},
 {'name': 'Timur Friedman', 'dblp': 'Timur_Friedman'},
 {'name': 'Andrea Araldo', 'dblp': 'Andrea_Araldo'},
 {'name': 'Andrea Marcano', 'dblp': 'Andrea_Marcano'},
 {'name': 'Dimitrios Milioris', 'dblp': 'Dimitrios_Milioris'},
 {'name': 'Elie de Panafieu', 'dblp': 'Elie_de_Panafieu'},
 {'name': 'Fabio Pianese', 'dblp': 'Fabio_Pianese'},
 {'name': 'Francesca Bassi', 'dblp': 'Francesca_Bassi'},
 {'name': 'François Durand', 'dblp': 'François_Durand'},
 {'name': 'Joaquin Garcia-Alfaro', 'dblp': 'Joaquin_Garcia'},
 {'name': 'Leonardo Linguaglossa', 'dblp': 'Leonardo_Linguaglossa'},
 {'name': 'Lorenzo Maggi', 'dblp': 'Lorenzo_Maggi'},
 {'name': 'Luis Uzeda Garcia', 'dblp': 'Luis_Uzeda_Garcia'},
 {'name': 'Marc-Olivier Buob', 'dblp': 'Marc-Olivier_Buob'},
 {'name': 'Mauro Sozio', 'dblp': 'Mauro_Sozio'},
 {'name': 'Natalya Rozhnova', 'dblp': 'Natalya_Rozhnova'},
 {'name': 'Remi Varloot', 'dblp': 'Rémi_Varloot'},
 {'name': 'Sara Ayoubi', 'dblp': 'Sara_Ayoubi'},
 {'name': 'Stefano Paris', 'dblp': 'Stefano_Paris'},
 {'name': 'Tianzhu Zhang', 'dblp': 'Tianzhu_Zhang'},
 {'name': 'Tijani Chahed', 'dblp': 'Tijani_Chahed'}]

DBLP Gismo¶

In this Section, we use Landmarks to construct a small XGismo focused around the LINCS researchers. In details: - We construct a large Gismo between articles and researchers, exactly like in the DBLP tutorial; - We use landmarks to extract a (much smaller) list of articles based on collaboration proximity. - We build a XGismo between researchers and keywords from this smaller source.

Construction of a full Gismo on authors¶

This part is similar to the one from the DBLP tutorial.

[12]:

from gismo.corpus import Corpus
from gismo.embedding import Embedding
from gismo.gismo import Gismo
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_author = CountVectorizer(dtype=float, preprocessor=lambda x:x, tokenizer=lambda x: x.split(' '))

[13]:

def to_authors_text(dic):
    return " ".join([a.replace(' ', '_') for a in dic['authors']])
corpus = Corpus(source, to_text=to_authors_text)

[14]:

try:
    embedding = Embedding.load(filename="dblp_aut_embedding", path=path)
except:
    embedding = Embedding(vectorizer=vectorizer_author)
    embedding.fit_transform(corpus)
    embedding.dump(filename="dblp_aut_embedding", path=path)

[15]:

gismo = Gismo(corpus, embedding)

Given the size of the dataset, processing a query can take about one second.

[16]:

gismo.rank("Fabien_Mathieu")

[16]:

True

[17]:

from gismo.post_processing import post_features_cluster_print
gismo.post_features_cluster = post_features_cluster_print
gismo.get_features_by_cluster()

 F: 0.02. R: 0.25. S: 0.74.
- F: 0.03. R: 0.25. S: 0.72.
-- F: 0.04. R: 0.24. S: 0.72.
--- F: 0.06. R: 0.21. S: 0.67.
---- F: 0.06. R: 0.19. S: 0.61.
----- F: 0.25. R: 0.18. S: 0.55.
------ F: 0.30. R: 0.18. S: 0.58.
------- F: 0.42. R: 0.15. S: 0.82.
-------- Fabien_Mathieu (R: 0.11; S: 1.00)
-------- F: 0.61. R: 0.04. S: 0.43.
--------- Laurent_Viennot (R: 0.02; S: 0.47)
--------- F: 0.70. R: 0.02. S: 0.36.
---------- Diego_Perino (R: 0.01; S: 0.42)
---------- Yacine_Boufkhad (R: 0.01; S: 0.30)
------- F: 0.77. R: 0.03. S: 0.36.
-------- Julien_Reynier (R: 0.01; S: 0.37)
-------- Fabien_de_Montgolfier (R: 0.01; S: 0.38)
-------- Anh-Tuan_Gai (R: 0.01; S: 0.29)
------ Gheorghe_Postelnicu (R: 0.00; S: 0.19)
----- F: 0.33. R: 0.01. S: 0.29.
------ The_Dang_Huynh (R: 0.01; S: 0.28)
------ Dohy_Hong (R: 0.00; S: 0.18)
---- F: 0.93. R: 0.02. S: 0.32.
----- Ludovic_Noirie (R: 0.01; S: 0.34)
----- François_Durand (R: 0.01; S: 0.32)
--- F: 0.40. R: 0.02. S: 0.34.
---- F: 0.58. R: 0.02. S: 0.33.
----- Céline_Comte (R: 0.01; S: 0.32)
----- Thomas_Bonald (R: 0.01; S: 0.24)
---- Anne_Bouillard (R: 0.00; S: 0.21)
--- Nidhi_Hegde_0001 (R: 0.00; S: 0.21)
-- F: 1.00. R: 0.01. S: 0.24.
--- Ilkka_Norros (R: 0.01; S: 0.24)
--- François_Baccelli (R: 0.01; S: 0.24)
- Mohamed_Bouklit (R: 0.01; S: 0.19)

Using landmarks to shrink a source¶

To reduce the size of the dataset, we make landmarks out of the researchers, and we credit each entry with a budget of 2,000 articles.

[18]:

from gismo.landmarks import Landmarks
lincs_landmarks_full = Landmarks(source=lincs, to_text=lambda x: x['dblp'],
                                 x_density=2000)

We launch the computation of the source. This takes a couple of minutes, as a ranking diffusion needs to be performed for all researchers.

[19]:

import logging
logging.basicConfig()
log = logging.getLogger("Gismo")
log.setLevel(level=logging.DEBUG)

[20]:

reduced_source = lincs_landmarks_full.get_reduced_source(gismo)

INFO:Gismo:Start computation of 41 landmarks.
DEBUG:Gismo:Processing Alonso_Silva.
DEBUG:Gismo:Landmarks of Alonso_Silva computed.
DEBUG:Gismo:Processing Ana_Busic.
DEBUG:Gismo:Landmarks of Ana_Busic computed.
DEBUG:Gismo:Processing Anaïs_Vergne.
DEBUG:Gismo:Landmarks of Anaïs_Vergne computed.
DEBUG:Gismo:Processing Bartek_Blaszczyszyn.
DEBUG:Gismo:Landmarks of Bartek_Blaszczyszyn computed.
DEBUG:Gismo:Processing Chung_Shue_Chen.
DEBUG:Gismo:Landmarks of Chung_Shue_Chen computed.
DEBUG:Gismo:Processing Daniel_Kofman.
DEBUG:Gismo:Landmarks of Daniel_Kofman computed.
DEBUG:Gismo:Processing Eitan_Altman.
DEBUG:Gismo:Landmarks of Eitan_Altman computed.
DEBUG:Gismo:Processing François_Baccelli.
DEBUG:Gismo:Landmarks of François_Baccelli computed.
DEBUG:Gismo:Processing Laurent_Decreusefond.
DEBUG:Gismo:Landmarks of Laurent_Decreusefond computed.
DEBUG:Gismo:Processing Ludovic_Noirie.
DEBUG:Gismo:Landmarks of Ludovic_Noirie computed.
DEBUG:Gismo:Processing Makhlouf_Hadji.
DEBUG:Gismo:Landmarks of Makhlouf_Hadji computed.
DEBUG:Gismo:Processing Marc_Lelarge.
DEBUG:Gismo:Landmarks of Marc_Lelarge computed.
DEBUG:Gismo:Processing Marceau_Coupechoux.
DEBUG:Gismo:Landmarks of Marceau_Coupechoux computed.
DEBUG:Gismo:Processing Maria_Potop-Butucaru.
DEBUG:Gismo:Landmarks of Maria_Potop-Butucaru computed.
DEBUG:Gismo:Processing Petr_Kuznetsov.
DEBUG:Gismo:Landmarks of Petr_Kuznetsov computed.
DEBUG:Gismo:Processing Philippe_Jacquet.
DEBUG:Gismo:Landmarks of Philippe_Jacquet computed.
DEBUG:Gismo:Processing Philippe_Martins.
DEBUG:Gismo:Landmarks of Philippe_Martins computed.
DEBUG:Gismo:Processing Renata_Teixeira.
DEBUG:Gismo:Landmarks of Renata_Teixeira computed.
DEBUG:Gismo:Processing Serge_Fdida.
DEBUG:Gismo:Landmarks of Serge_Fdida computed.
DEBUG:Gismo:Processing Sébastien_Tixeuil.
DEBUG:Gismo:Landmarks of Sébastien_Tixeuil computed.
DEBUG:Gismo:Processing Thomas_Bonald.
DEBUG:Gismo:Landmarks of Thomas_Bonald computed.
DEBUG:Gismo:Processing Timur_Friedman.
DEBUG:Gismo:Landmarks of Timur_Friedman computed.
DEBUG:Gismo:Processing Andrea_Araldo.
DEBUG:Gismo:Landmarks of Andrea_Araldo computed.
DEBUG:Gismo:Processing Andrea_Marcano.
DEBUG:Gismo:Landmarks of Andrea_Marcano computed.
DEBUG:Gismo:Processing Dimitrios_Milioris.
DEBUG:Gismo:Landmarks of Dimitrios_Milioris computed.
DEBUG:Gismo:Processing Elie_de_Panafieu.
DEBUG:Gismo:Landmarks of Elie_de_Panafieu computed.
DEBUG:Gismo:Processing Fabio_Pianese.
DEBUG:Gismo:Landmarks of Fabio_Pianese computed.
DEBUG:Gismo:Processing Francesca_Bassi.
DEBUG:Gismo:Landmarks of Francesca_Bassi computed.
DEBUG:Gismo:Processing François_Durand.
DEBUG:Gismo:Landmarks of François_Durand computed.
DEBUG:Gismo:Processing Joaquin_Garcia.
DEBUG:Gismo:Landmarks of Joaquin_Garcia computed.
DEBUG:Gismo:Processing Leonardo_Linguaglossa.
DEBUG:Gismo:Landmarks of Leonardo_Linguaglossa computed.
DEBUG:Gismo:Processing Lorenzo_Maggi.
DEBUG:Gismo:Landmarks of Lorenzo_Maggi computed.
DEBUG:Gismo:Processing Luis_Uzeda_Garcia.
DEBUG:Gismo:Landmarks of Luis_Uzeda_Garcia computed.
DEBUG:Gismo:Processing Marc-Olivier_Buob.
DEBUG:Gismo:Landmarks of Marc-Olivier_Buob computed.
DEBUG:Gismo:Processing Mauro_Sozio.
DEBUG:Gismo:Landmarks of Mauro_Sozio computed.
DEBUG:Gismo:Processing Natalya_Rozhnova.
DEBUG:Gismo:Landmarks of Natalya_Rozhnova computed.
DEBUG:Gismo:Processing Rémi_Varloot.
DEBUG:Gismo:Landmarks of Rémi_Varloot computed.
DEBUG:Gismo:Processing Sara_Ayoubi.
DEBUG:Gismo:Landmarks of Sara_Ayoubi computed.
DEBUG:Gismo:Processing Stefano_Paris.
DEBUG:Gismo:Landmarks of Stefano_Paris computed.
DEBUG:Gismo:Processing Tianzhu_Zhang.
DEBUG:Gismo:Landmarks of Tianzhu_Zhang computed.
DEBUG:Gismo:Processing Tijani_Chahed.
DEBUG:Gismo:Landmarks of Tijani_Chahed computed.
INFO:Gismo:All landmarks are built.

[21]:

print(f"Source length went down from {len(source)} to {len(reduced_source)}.")

Source length went down from 6232511 to 57267.

Instead of 6,000,000 all purposes articles, we now have 57,000 articles lying in the neighborhood of the considered researchers. We now can close the file descriptor as we won’t need further access to the original source.

[22]:

source.close()

Building a small XGismo¶

Author Embedding¶

Author embedding takes a couple of seconds instead of a couple of minutes.

[23]:

reduced_corpus = Corpus(reduced_source, to_text=to_authors_text)
reduced_author_embedding = Embedding(vectorizer=vectorizer_author)
reduced_author_embedding.fit_transform(reduced_corpus)

C:\Users\fabienma\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:528: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn(

Sanity Check¶

We can rebuild a small author Gismo. This part is merely a sanity check to verify that the reduction didn’t change too much things in the vicinity of the LINCS..

[24]:

reduced_gismo = Gismo(reduced_corpus, reduced_author_embedding)

Ranking is nearly instant.

[25]:

reduced_gismo.rank("Fabien_Mathieu")

[25]:

True

The results are almost identical to what was returned by the full Gismo.

[26]:

from gismo.post_processing import post_features_cluster_print
reduced_gismo.post_features_cluster = post_features_cluster_print
reduced_gismo.get_features_by_cluster()

 F: 0.02. R: 0.26. S: 0.71.
- F: 0.03. R: 0.25. S: 0.70.
-- F: 0.05. R: 0.24. S: 0.69.
--- F: 0.07. R: 0.19. S: 0.61.
---- F: 0.25. R: 0.18. S: 0.55.
----- F: 0.30. R: 0.18. S: 0.58.
------ F: 0.41. R: 0.15. S: 0.82.
------- Fabien_Mathieu (R: 0.11; S: 1.00)
------- F: 0.60. R: 0.04. S: 0.43.
-------- Laurent_Viennot (R: 0.02; S: 0.46)
-------- F: 0.69. R: 0.02. S: 0.36.
--------- Diego_Perino (R: 0.01; S: 0.41)
--------- Yacine_Boufkhad (R: 0.01; S: 0.29)
------ F: 0.77. R: 0.03. S: 0.36.
------- Julien_Reynier (R: 0.01; S: 0.37)
------- Fabien_de_Montgolfier (R: 0.01; S: 0.37)
------- Anh-Tuan_Gai (R: 0.01; S: 0.29)
----- Gheorghe_Postelnicu (R: 0.00; S: 0.19)
---- F: 0.33. R: 0.01. S: 0.29.
----- The_Dang_Huynh (R: 0.01; S: 0.28)
----- Dohy_Hong (R: 0.00; S: 0.18)
--- F: 0.39. R: 0.02. S: 0.34.
---- F: 0.58. R: 0.02. S: 0.33.
----- Céline_Comte (R: 0.01; S: 0.32)
----- Thomas_Bonald (R: 0.01; S: 0.24)
---- Anne_Bouillard (R: 0.00; S: 0.21)
--- F: 0.57. R: 0.02. S: 0.29.
---- F: 0.94. R: 0.02. S: 0.32.
----- François_Durand (R: 0.01; S: 0.32)
----- Ludovic_Noirie (R: 0.01; S: 0.34)
---- Benoît_Kloeckner (R: 0.00; S: 0.18)
--- Nidhi_Hegde_0001 (R: 0.00; S: 0.21)
-- F: 1.00. R: 0.01. S: 0.24.
--- Ilkka_Norros (R: 0.01; S: 0.24)
--- François_Baccelli (R: 0.00; S: 0.24)
- Mohamed_Bouklit (R: 0.01; S: 0.19)

Word Embedding¶

Now we build the word embedding, with the spacy add-on. Takes a couple of minutes instead of a couple of hours.

[27]:

import spacy
# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
# Who cares about DET and such?
keep = {'ADJ', 'NOUN', 'NUM', 'PROPN', 'SYM', 'VERB'}

preprocessor=lambda txt: " ".join([token.lemma_.lower() for token in nlp(txt)
                                   if token.pos_ in keep and not token.is_stop])
vectorizer_text = CountVectorizer(dtype=float, min_df=5, max_df=.02, ngram_range=(1, 3), preprocessor=preprocessor)

[28]:

reduced_corpus.to_text = lambda e: e['title']
reduced_word_embedding = Embedding(vectorizer=vectorizer_text)
reduced_word_embedding.fit_transform(reduced_corpus)

Gathering pieces together¶

We can combine the reduced embeddings to build a XGismo between authors and words.

[29]:

from gismo.gismo import XGismo
xgismo = XGismo(x_embedding=reduced_author_embedding, y_embedding=reduced_word_embedding)

We can save this for later use.

[31]:

xgismo.dump(filename="reduced_lincs_xgismo", path=path, overwrite=True)

The file should be about 53 Mb, whereas a full-size DBLP XGismo is about 4 Gb. What about speed and quality of results?

[32]:

xgismo.rank("Anne_Bouillard", y=False)

[32]:

True

[33]:

xgismo.get_documents_by_rank()

[33]:

['Anne_Bouillard',
 'Paul_Nikolaus',
 'Jens_B._Schmitt',
 'Bruno_Gaujal',
 'Steffen_Bondorf',
 'Albert_Benveniste',
 'Sidney_Rosario',
 'Seyed_Mohammadhossein_Tabatabaee',
 'Fabien_Geyer',
 'Jean-Yves_Le_Boudec',
 'Stefan_Haar',
 'Ana_Busic',
 'Claude_Jard',
 'Giovanni_Stea',
 'Jean_Mairesse',
 'Eric_Thierry',
 'Nico_M._van_Dijk',
 'Sean_P._Meyn',
 'Zhen_Liu_0001',
 'Eitan_Altman',
 'François_Baccelli',
 'Aurore_Junier']

Let’s try some more elaborate display.

[34]:

from gismo.post_processing import post_documents_cluster_print, post_features_cluster_print
xgismo.parameters.distortion = 1.0
xgismo.post_documents_cluster = post_documents_cluster_print
xgismo.post_features_cluster = post_features_cluster_print
xgismo.get_documents_by_cluster()

 F: 0.03. R: 0.10. S: 0.76.
- F: 0.08. R: 0.08. S: 0.71.
-- F: 0.51. R: 0.07. S: 0.70.
--- F: 0.69. R: 0.06. S: 0.69.
---- F: 0.80. R: 0.06. S: 0.68.
----- F: 0.92. R: 0.06. S: 0.68.
------ Anne_Bouillard (R: 0.02; S: 0.80)
------ Paul_Nikolaus (R: 0.01; S: 0.68)
------ Jens_B._Schmitt (R: 0.01; S: 0.68)
------ Steffen_Bondorf (R: 0.00; S: 0.63)
------ Seyed_Mohammadhossein_Tabatabaee (R: 0.00; S: 0.63)
------ Fabien_Geyer (R: 0.00; S: 0.64)
------ Jean-Yves_Le_Boudec (R: 0.00; S: 0.68)
------ Eric_Thierry (R: 0.00; S: 0.69)
----- Bruno_Gaujal (R: 0.01; S: 0.72)
---- Giovanni_Stea (R: 0.00; S: 0.66)
--- Aurore_Junier (R: 0.00; S: 0.51)
-- F: 0.42. R: 0.02. S: 0.42.
--- F: 0.48. R: 0.01. S: 0.39.
---- Ana_Busic (R: 0.00; S: 0.33)
---- F: 0.52. R: 0.00. S: 0.28.
----- Nico_M._van_Dijk (R: 0.00; S: 0.23)
----- Zhen_Liu_0001 (R: 0.00; S: 0.31)
---- F: 0.73. R: 0.01. S: 0.34.
----- Sean_P._Meyn (R: 0.00; S: 0.25)
----- Eitan_Altman (R: 0.00; S: 0.42)
----- François_Baccelli (R: 0.00; S: 0.30)
--- Jean_Mairesse (R: 0.00; S: 0.31)
- F: 0.90. R: 0.01. S: 0.29.
-- F: 0.98. R: 0.01. S: 0.29.
--- Albert_Benveniste (R: 0.00; S: 0.28)
--- Sidney_Rosario (R: 0.00; S: 0.28)
--- Stefan_Haar (R: 0.00; S: 0.32)
-- Claude_Jard (R: 0.00; S: 0.32)

[35]:

xgismo.get_features_by_cluster(target_k=1.5, resolution=.5, distortion=.5)

 F: 0.35. R: 0.21. S: 0.91.
- F: 0.49. R: 0.17. S: 0.88.
-- F: 0.75. R: 0.13. S: 0.84.
--- network calculus (R: 0.05; S: 0.87)
--- calculus (R: 0.04; S: 0.87)
--- stochastic network calculus (R: 0.02; S: 0.76)
--- stochastic network (R: 0.01; S: 0.72)
--- multiplexing (R: 0.01; S: 0.80)
-- free choice (R: 0.01; S: 0.72)
-- ospf (R: 0.01; S: 0.56)
-- end (R: 0.01; S: 0.63)
- orchestrations (R: 0.03; S: 0.51)
- stochastic (R: 0.01; S: 0.51)

Rebuild landmarks¶

Lincs landmarks¶

We can rebuild Lincs landmarks on the XGismo.

[36]:

lincs_landmarks = Landmarks(source=lincs, to_text=lambda x: x['dblp'],
                           rank = lambda g, q: g.rank(q, y=False))
lincs_landmarks.fit(xgismo)

INFO:Gismo:Start computation of 41 landmarks.
DEBUG:Gismo:Processing Alonso_Silva.
DEBUG:Gismo:Landmarks of Alonso_Silva computed.
DEBUG:Gismo:Processing Ana_Busic.
DEBUG:Gismo:Landmarks of Ana_Busic computed.
DEBUG:Gismo:Processing Anaïs_Vergne.
DEBUG:Gismo:Landmarks of Anaïs_Vergne computed.
DEBUG:Gismo:Processing Bartek_Blaszczyszyn.
DEBUG:Gismo:Landmarks of Bartek_Blaszczyszyn computed.
DEBUG:Gismo:Processing Chung_Shue_Chen.
DEBUG:Gismo:Landmarks of Chung_Shue_Chen computed.
DEBUG:Gismo:Processing Daniel_Kofman.
DEBUG:Gismo:Landmarks of Daniel_Kofman computed.
DEBUG:Gismo:Processing Eitan_Altman.
DEBUG:Gismo:Landmarks of Eitan_Altman computed.
DEBUG:Gismo:Processing François_Baccelli.
DEBUG:Gismo:Landmarks of François_Baccelli computed.
DEBUG:Gismo:Processing Laurent_Decreusefond.
DEBUG:Gismo:Landmarks of Laurent_Decreusefond computed.
DEBUG:Gismo:Processing Ludovic_Noirie.
DEBUG:Gismo:Landmarks of Ludovic_Noirie computed.
DEBUG:Gismo:Processing Makhlouf_Hadji.
DEBUG:Gismo:Landmarks of Makhlouf_Hadji computed.
DEBUG:Gismo:Processing Marc_Lelarge.
DEBUG:Gismo:Landmarks of Marc_Lelarge computed.
DEBUG:Gismo:Processing Marceau_Coupechoux.
DEBUG:Gismo:Landmarks of Marceau_Coupechoux computed.
DEBUG:Gismo:Processing Maria_Potop-Butucaru.
DEBUG:Gismo:Landmarks of Maria_Potop-Butucaru computed.
DEBUG:Gismo:Processing Petr_Kuznetsov.
DEBUG:Gismo:Landmarks of Petr_Kuznetsov computed.
DEBUG:Gismo:Processing Philippe_Jacquet.
DEBUG:Gismo:Landmarks of Philippe_Jacquet computed.
DEBUG:Gismo:Processing Philippe_Martins.
DEBUG:Gismo:Landmarks of Philippe_Martins computed.
DEBUG:Gismo:Processing Renata_Teixeira.
DEBUG:Gismo:Landmarks of Renata_Teixeira computed.
DEBUG:Gismo:Processing Serge_Fdida.
DEBUG:Gismo:Landmarks of Serge_Fdida computed.
DEBUG:Gismo:Processing Sébastien_Tixeuil.
DEBUG:Gismo:Landmarks of Sébastien_Tixeuil computed.
DEBUG:Gismo:Processing Thomas_Bonald.
DEBUG:Gismo:Landmarks of Thomas_Bonald computed.
DEBUG:Gismo:Processing Timur_Friedman.
DEBUG:Gismo:Landmarks of Timur_Friedman computed.
DEBUG:Gismo:Processing Andrea_Araldo.
DEBUG:Gismo:Landmarks of Andrea_Araldo computed.
DEBUG:Gismo:Processing Andrea_Marcano.
DEBUG:Gismo:Landmarks of Andrea_Marcano computed.
DEBUG:Gismo:Processing Dimitrios_Milioris.
DEBUG:Gismo:Landmarks of Dimitrios_Milioris computed.
DEBUG:Gismo:Processing Elie_de_Panafieu.
DEBUG:Gismo:Landmarks of Elie_de_Panafieu computed.
DEBUG:Gismo:Processing Fabio_Pianese.
DEBUG:Gismo:Landmarks of Fabio_Pianese computed.
DEBUG:Gismo:Processing Francesca_Bassi.
DEBUG:Gismo:Landmarks of Francesca_Bassi computed.
DEBUG:Gismo:Processing François_Durand.
DEBUG:Gismo:Landmarks of François_Durand computed.
DEBUG:Gismo:Processing Joaquin_Garcia.
DEBUG:Gismo:Landmarks of Joaquin_Garcia computed.
DEBUG:Gismo:Processing Leonardo_Linguaglossa.
DEBUG:Gismo:Landmarks of Leonardo_Linguaglossa computed.
DEBUG:Gismo:Processing Lorenzo_Maggi.
DEBUG:Gismo:Landmarks of Lorenzo_Maggi computed.
DEBUG:Gismo:Processing Luis_Uzeda_Garcia.
DEBUG:Gismo:Landmarks of Luis_Uzeda_Garcia computed.
DEBUG:Gismo:Processing Marc-Olivier_Buob.
DEBUG:Gismo:Landmarks of Marc-Olivier_Buob computed.
DEBUG:Gismo:Processing Mauro_Sozio.
DEBUG:Gismo:Landmarks of Mauro_Sozio computed.
DEBUG:Gismo:Processing Natalya_Rozhnova.
DEBUG:Gismo:Landmarks of Natalya_Rozhnova computed.
DEBUG:Gismo:Processing Rémi_Varloot.
DEBUG:Gismo:Landmarks of Rémi_Varloot computed.
DEBUG:Gismo:Processing Sara_Ayoubi.
DEBUG:Gismo:Landmarks of Sara_Ayoubi computed.
DEBUG:Gismo:Processing Stefano_Paris.
DEBUG:Gismo:Landmarks of Stefano_Paris computed.
DEBUG:Gismo:Processing Tianzhu_Zhang.
DEBUG:Gismo:Landmarks of Tianzhu_Zhang computed.
DEBUG:Gismo:Processing Tijani_Chahed.
DEBUG:Gismo:Landmarks of Tijani_Chahed computed.
INFO:Gismo:All landmarks are built.

We can extract the Lincs researchers that the most similar to a given researcher (not necessarily from Lincs).

[37]:

xgismo.rank("Anne_Bouillard", y=False)
lincs_landmarks.post_item = lambda l, i: l[i]['name']
lincs_landmarks.get_landmarks_by_rank(xgismo)

[37]:

['Elie de Panafieu',
 'Ana Bušić',
 'Remi Varloot',
 'Marc-Olivier Buob',
 'François Baccelli',
 'Eitan Altman',
 'Philippe Jacquet',
 'Alonso Silva',
 'Marc Lelarge',
 'Dimitrios Milioris',
 'Tijani Chahed',
 'Bartek Blaszczyszyn',
 'Thomas Bonald']

We can also use a keyword query, and organize the results in clusters.

[38]:

xgismo.rank("Anne_Bouillard", y=False)
from gismo.post_processing import post_landmarks_cluster_print
lincs_landmarks.post_cluster = post_landmarks_cluster_print
lincs_landmarks.get_landmarks_by_cluster(xgismo, balance=.5, target_k=1.2)

 F: 0.25.
- Elie de Panafieu
- F: 0.35.
-- F: 0.36.
--- Ana Bušić
--- Remi Varloot
--- Marc-Olivier Buob
-- F: 0.53.
--- François Baccelli
--- Eitan Altman

ACM landmarks¶

We can build other landmarks using the ACM categories. This will enable to describe things in term of categories.

[39]:

from gismo.datasets.acm import get_acm, flatten_acm
acm = flatten_acm(get_acm(), min_size=10)

[40]:

acm_landmarks = Landmarks(acm, to_text=lambda e: e['query'])

[41]:

log.setLevel(logging.INFO)
acm_landmarks.fit(xgismo)

INFO:Gismo:Start computation of 113 landmarks.
INFO:Gismo:All landmarks are built.

[42]:

xgismo.rank("Fabien_Mathieu", y=False)
acm_landmarks.post_item = lambda l, i: l[i]['name']
acm_landmarks.get_landmarks_by_rank(xgismo, balance=.5, target_k=1.2)

[42]:

['Graph theory',
 'Discrete mathematics',
 'Contextual software domains',
 'Software organization and properties',
 'Theory of computation',
 'Architectures',
 'Software system structures',
 'Models of computation']

[43]:

xgismo.rank("combinatorics")
acm_landmarks.post_cluster = post_landmarks_cluster_print
acm_landmarks.get_landmarks_by_cluster(xgismo, balance=.5, target_k=1.5)

 F: 0.65.
- F: 0.96.
-- Discrete mathematics
-- Graph theory
- F: 0.74.
-- F: 0.96.
--- Symbolic and algebraic algorithms
--- Symbolic and algebraic manipulation
--- Numerical analysis
--- Mathematical analysis
-- F: 0.92.
--- Mathematics of computing
--- Data structures
--- Design and analysis of algorithms
--- Models of computation
--- Theory of computation

Note that we fully ignore the original ACM category hierarchy. Instead, Gismo builds its own hierarchy tailored to the query.

Combining landmarks¶

Through the post_processing methods, we can intricate multiple landmarks. For example, the following method associates Lincs researchers and keywords to a tree of ACM categories.

[44]:

from gismo.common import auto_k
import numpy as np
def post_cluster_acm(l, cluster, depth=0, kw_size=.3, mts_size=.5):
    tk_kw = 1/kw_size
    tk_mts = 1/mts_size
    n = l.x_direction.shape[0]

    kws_view = cluster.vector[0, n:]
    k = auto_k(data=kws_view.data, max_k=100, target=tk_kw)
    keywords = [xgismo.embedding.features[i]
                for i in kws_view.indices[np.argsort(-kws_view.data)[:k]]]

    if len(cluster.children) > 0:
        print(f"|{'-'*depth}")
        for c in cluster.children:
            post_cluster_acm(l, c, depth=depth+1)
    else:
        domain = l[cluster.indice]['name']
        researchers = ", ".join(lincs_landmarks.get_landmarks_by_rank(cluster,
                                                          target_k=tk_mts,
                                                        distortion=0.5))
        print(f"|{'-'*depth} {domain} ({researchers}) ({', '.join(keywords)})")

[45]:

xgismo.rank("combinatorics")
acm_landmarks.post_cluster = post_cluster_acm
acm_landmarks.get_landmarks_by_cluster(xgismo, target_k=1.5)

|
|-
|-- Discrete mathematics (Elie de Panafieu, Philippe Jacquet) (graph, combinatoric, analytic, random, analytic combinatoric)
|-- Graph theory (Elie de Panafieu, Philippe Jacquet) (graph, random, complexity, analytic)
|-
|--
|--- Symbolic and algebraic algorithms (Elie de Panafieu, Philippe Jacquet) (analytic, complexity, random, functions, graph)
|--- Symbolic and algebraic manipulation (Elie de Panafieu, Philippe Jacquet) (analytic, complexity, random, graph, functions, function)
|--- Numerical analysis (Elie de Panafieu, Philippe Jacquet) (complexity, gröbner basis, basis, gröbner, combinatoric, graph)
|--- Mathematical analysis (Elie de Panafieu, Philippe Jacquet) (complexity, graph, random, analytic, basis, gröbner basis, gröbner, combinatoric)
|--
|--- Mathematics of computing (Elie de Panafieu, Philippe Jacquet) (graph, complexity, random, analytic, combinatoric)
|--- Data structures (Philippe Jacquet, Elie de Panafieu) (graph, complexity, analytic, structure, random, datum)
|--- Design and analysis of algorithms (Philippe Jacquet, Elie de Panafieu) (graph, complexity, random, analytic)
|--- Models of computation (Philippe Jacquet, Elie de Panafieu) (complexity, graph, random, analytic)
|--- Theory of computation (Philippe Jacquet, Elie de Panafieu) (graph, complexity, random, analytic)

Conversely, one can associate ACM categories and keywords to a tree of Lincs researchers.

[46]:

def post_cluster_lincs(l, cluster, depth=0, kw_size=.3, acm_size=.5):
    tk_kw = 1/kw_size
    tk_acm = 1/acm_size
    n = l.x_direction.shape[0]

    kws_view = cluster.vector[0, n:]
    k = auto_k(data=kws_view.data, max_k=100, target=tk_kw)
    keywords = [xgismo.embedding.features[i]
                for i in kws_view.indices[np.argsort(-kws_view.data)[:k]]]

    if len(cluster.children) > 0:
        print(f"|{'-'*depth}")
        for c in cluster.children:
            post_cluster_lincs(l, c, depth=depth+1)
    else:
        researcher = l[cluster.indice]['name']
        domains = ", ".join(acm_landmarks.get_landmarks_by_rank(cluster,
                                                          target_k=tk_acm,
                                                        distortion=0.5))
        print(f"|{'-'*depth} {researcher} ({domains}) ({', '.join(keywords)})")

[47]:

xgismo.rank("Anne_Bouillard", y=False)
lincs_landmarks.post_cluster = post_cluster_lincs
lincs_landmarks.get_landmarks_by_cluster(xgismo, target_k=1.4)

|
|- Elie de Panafieu (Symbolic and algebraic algorithms, Symbolic and algebraic manipulation, Discrete mathematics, Models of computation, Mathematical analysis, Mathematics of computing, Graph theory, Design and analysis of algorithms, Theory of computation) (network calculus, calculus, analytic, combinatoric, analytic combinatoric, kernels)
|-
|-- Ana Bušić (Mathematical analysis, Machine learning algorithms, Mathematics of computing, Symbolic and algebraic algorithms, Symbolic and algebraic manipulation) (stochastic, exact, queue network, perfect, perfect sample, sampling, queue, matching, sample)
|-- Remi Varloot (Symbolic and algebraic algorithms, Symbolic and algebraic manipulation, Discrete mathematics, Graph theory, Models of computation, Mathematical analysis, Design and analysis of algorithms, Mathematics of computing) (dynamics, glauber dynamics, glauber, random generation, independent sets, sets, independent, speed, generation)
|-- Marc-Olivier Buob (Symbolic and algebraic algorithms) (space time, pattern, alarm, space, optimal routing, end, log, pattern matching, delays, logs, lightweight)
|-- François Baccelli (Computer graphics) (stochastic, queue, end, delay, throughput)

That’s all for this tutorial!