DBLP exploration

This tutorial shows how explore DBLP with Gismo.

If you have never used Gismo before, you may want to start with the Toy example tutorial or the ACM tutorial.

Note: the DBLP databased is not small. Recommended requirements to excute this Notebook: - Fast Internet connection (you will need to download a few hundred Mb) - 4 Gb of free space - 4 Gb of RAM (8Gb or more recommended) - Descent CPU (can take more than one hour on slow CPUs)

Here, documents are articles in DBLP. The features of an article category will vary.

Initialisation

First, we load the required packages.

[1]:
import numpy as np
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from pathlib import Path
from functools import partial

from gismo.datasets.dblp import Dblp
from gismo.filesource import FileSource
from gismo.corpus import Corpus
from gismo.embedding import Embedding
from gismo.gismo import Gismo
from gismo.post_processing import post_features_cluster_print, post_documents_cluster_print

Then, we prepare the DBLP source.

First we choose the location of the DBLP files. If you want to run this Notebook at your place, please change the path below and check that it exists.

[2]:
path = Path("../../../../../../Datasets/DBLP")
path.exists()
[2]:
True

Construction of the dblp files. Only needs to be performed the first time or when you want to refresh the database. Takes about 10 minutes on a Surface Pro 4 with fiber Internet connection.

[3]:
dblp = Dblp(path=path)
dblp.build()
File ..\..\..\..\..\..\Datasets\DBLP\dblp.xml.gz already exists. Use refresh option to overwrite.
File ..\..\..\..\..\..\Datasets\DBLP\dblp.data already exists. Use refresh option to overwrite.

Then, we can load the database as a filesource.

[4]:
source = FileSource(filename="dblp", path=path)
source[0]
[4]:
{'type': 'article',
 'authors': ['Paul Kocher',
  'Daniel Genkin',
  'Daniel Gruss',
  'Werner Haas 0004',
  'Mike Hamburg',
  'Moritz Lipp',
  'Stefan Mangard',
  'Thomas Prescher 0002',
  'Michael Schwarz 0001',
  'Yuval Yarom'],
 'title': 'Spectre Attacks: Exploiting Speculative Execution.',
 'venue': 'meltdownattack.com',
 'year': '2018'}

Each article is a dict with fields type, venue, title, year, and authors. We build a corpus that will tell Gismo that the content of an article is its title value.

[5]:
corpus = Corpus(source, to_text=lambda x: x['title'])

We build an embedding on top of that corpus. - We set min_df=30 to exclude rare features; - We set max_df=.02 to exclude anything present in more than 2% of the corpus; - We use spacy to lemmatize & remove some stopwords; remove preprocessor=... from the input if you want to skip this (takes time); - A few manually selected stopwords to fine-tune things. - We set ngram_range=(1, 2) to include bi-grams in the embedding.

This will take a few minutes (without spacy) up to a few hours (with spacy enabled). You can save the embedding for later if you want.

[6]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
keep = {'ADJ', 'NOUN', 'NUM', 'PROPN', 'SYM', 'VERB'}
vectorizer = CountVectorizer(min_df=30, max_df=.02, ngram_range=(1, 2), dtype=float,
                             preprocessor=lambda txt: " ".join([w.lemma_.lower() for w in nlp(txt)
                                                                if w.pos_ in keep and not w.is_stop]),
                             stop_words=['a', 'about', 'an', 'and', 'for', 'from', 'in', 'of', 'on', 'the', 'with'])

try:
    embedding = Embedding.load(filename="dblp_embedding", path=path)
except:
    embedding = Embedding(vectorizer=vectorizer)
    embedding.fit_transform(corpus)
    embedding.dump(filename="dblp_embedding", path=path)
[7]:
embedding.x
[7]:
<6232511x192946 sparse matrix of type '<class 'numpy.float64'>'
        with 61239343 stored elements in Compressed Sparse Row format>

We see from embedding.x that the embedding links about 6,200,000 documents to 193,000 features. In average, each document is linked to about 10 features.

Now, we initiate the gismo object, and customize post_processers to ease the display.

[8]:
gismo = Gismo(corpus, embedding)
[9]:
def post_article(g, i):
    dic = g.corpus[i]
    authors = ", ".join(dic['authors'])
    return f"{dic['title']} By {authors} ({dic['venue']}, {dic['year']})"

gismo.post_documents_item = post_article

def post_title(g, i):
    return g.corpus[i]['title']
    authors = ", ".join(dic['authors'])
    return f"{dic['title']} By {authors} ({dic['venue']}, {dic['year']})"

def post_meta(g, i):
    dic = g.corpus[i]
    authors = ", ".join(dic['authors'])
    return f"{authors} ({dic['venue']}, {dic['year']})"


gismo.post_documents_cluster = partial(post_documents_cluster_print, post_item=post_title)
gismo.post_features_cluster = post_features_cluster_print

As the dataset is big, we lower the precision of the computation to speed up things a little bit.

[10]:
gismo.parameters.n_iter = 2

Machine Learning (and Covid-19) query

We perform the query Machine learning. The returned True tells that some of the query features were found in the corpus’ features.

[11]:
gismo.rank("Machine Learning")
[11]:
True

What are the best articles on Machine Learning?

[12]:
gismo.get_documents_by_rank()
[12]:
['Resilient Machine Learning for Networked Cyber Physical Systems: A Survey for Machine Learning Security to Securing Machine Learning for CPS. By Felix O. Olowononi, Danda B. Rawat, Chunmei Liu (IEEE Commun. Surv. Tutorials, 2021)',
 'Resilient Machine Learning for Networked Cyber Physical Systems: A Survey for Machine Learning Security to Securing Machine Learning for CPS. By Felix O. Olowononi, Danda B. Rawat, Chunmei Liu (CoRR, 2021)',
 'The Machine Learning Machine: A Tangible User Interface for Teaching Machine Learning. By Magnus Høholt Kaspersen, Karl-Emil Kjær Bilstrup, Marianne Graves Petersen (TEI, 2021)',
 'Resilient Machine Learning (rML) Ensemble Against Adversarial Machine Learning Attacks. By Likai Yao, Cihan Tunc, Pratik Satam, Salim Hariri (DDDAS, 2020)',
 'A Machine Learning Tutorial for Operational Meteorology, Part I: Traditional Machine Learning. By Randy J. Chase, David R. Harrison, Amanda Burke, Gary M. Lackmann, Amy McGovern (CoRR, 2022)',
 'Critical Tools for Machine Learning: Situating, Figuring, Diffracting, Fabulating Machine Learning Systems Design. By Goda Klumbyte, Claude Draude, Alex S. Taylor (CHItaly, 2021)',
 'Poisoning Attacks Against Machine Learning: Can Machine Learning Be Trustworthy? By Alina Oprea, Anoop Singhal, Apostol Vassilev 0001 (Computer, 2022)',
 'When Physics Meets Machine Learning: A Survey of Physics-Informed Machine Learning. By Chuizheng Meng, Sungyong Seo, Defu Cao, Sam Griesemer, Yan Liu (CoRR, 2022)',
 'Predicting Machine Learning Pipeline Runtimes in the Context of Automated Machine Learning. By Felix Mohr, Marcel Wever, Alexander Tornede, Eyke Hüllermeier (IEEE Trans. Pattern Anal. Mach. Intell., 2021)',
 'Adversarial Machine Learning: Difficulties in Applying Machine Learning to Existing Cybersecurity Systems. By Nick Rahimi, Jordan Maynor, Bidyut Gupta (CATA, 2020)',
 'Hacking Machine Learning: Towards The Comprehensive Taxonomy of Attacks Against Machine Learning Systems. By Jerzy Surma (ICIAI, 2020)',
 'Machine Learning Unplugged - Development and Evaluation of a Workshop About Machine Learning. By Elisaweta Ossovski, Michael Brinkmeier (ISSEP, 2019)',
 'Can machine learning model with static features be fooled: an adversarial machine learning approach. By Rahim Taheri, Reza Javidan, Mohammad Shojafar, P. Vinod 0001, Mauro Conti (Clust. Comput., 2020)',
 'Machine Learning for Health ( ML4H ) 2019 : What Makes Machine Learning in Medicine Different? By Adrian V. Dalca, Matthew B. A. McDermott, Emily Alsentzer, Samuel G. Finlayson, Michael Oberst, Fabian Falck, Corey Chivers, Andrew Beam, Tristan Naumann, Brett K. Beaulieu-Jones (ML4H@NeurIPS, 2019)',
 'Ensemble Machine Learning Methods to Predict the Balancing of Ayurvedic Constituents in the Human Body Ensemble Machine Learning Methods to Predict. By Vani Rajasekar, Sathya Krishnamoorthi, Muzafer Saracevic, Dzenis Pepic, Mahir Zajmovic, Haris Zogic (Comput. Sci., 2022)',
 'Special session on machine learning: How will machine learning transform test? By Yiorgos Makris, Amit Nahar, Haralampos-G. D. Stratigopoulos, Marc Hutner (VTS, 2018)',
 'Informed Machine Learning - Towards a Taxonomy of Explicit Integration of Knowledge into Machine Learning. By Laura von Rüden, Sebastian Mayer, Jochen Garcke, Christian Bauckhage, Jannis Schücker (CoRR, 2019)',
 'Machine Learning in Antenna Design: An Overview on Machine Learning Concept and Algorithms. By Hilal M. El Misilmani, Tarek Naous (HPCS, 2019)',
 'Can Machine Learning Model with Static Features be Fooled: an Adversarial Machine Learning Approach. By Rahim Taheri, Reza Javidan, Mohammad Shojafar, Vinod P 0001, Mauro Conti (CoRR, 2019)',
 'Physics Informed Machine Learning of SPH: Machine Learning Lagrangian Turbulence. By Michael Woodward, Yifeng Tian, Criston Hyett, Chris Fryer, Daniel Livescu, Mikhail Stepanov, Michael Chertkov (CoRR, 2021)',
 'Targeted Machine Learning: how we can use machine learning for causal inference. By Antoine Chambaz (EGC, 2021)',
 'Arabic Offensive Language Detection Using Machine Learning and Ensemble Machine Learning Approaches. By Fatemah Husain (CoRR, 2020)',
 'How Developers Iterate on Machine Learning Workflows - A Survey of the Applied Machine Learning Literature. By Doris Xin, Litian Ma, Shuchen Song, Aditya G. Parameswaran (CoRR, 2018)',
 'Declarative Machine Learning Systems: The future of machine learning will depend on it being in the hands of the rest of us. By Piero Molino, Christopher Ré (ACM Queue, 2021)',
 'MARVIN: An Open Machine Learning Corpus and Environment for Automated Machine Learning Primitive Annotation and Execution. By Chris A. Mattmann, Sujen Shah, Brian Wilson (CoRR, 2018)',
 'Securing Machine Learning in the Cloud: A Systematic Review of Cloud Machine Learning Security. By Adnan Qayyum, Aneeqa Ijaz, Muhammad Usama, Waleed Iqbal, Junaid Qadir 0001, Yehia Elkhatib, Ala I. Al-Fuqaha (Frontiers Big Data, 2020)',
 'How to Conduct Rigorous Supervised Machine Learning in Information Systems Research: The Supervised Machine Learning Report Card. By Niklas Kühl, Robin Hirt, Lucas Baier, Björn Schmitz, Gerhard Satzger (Commun. Assoc. Inf. Syst., 2021)',
 "Exploring the Effects of Machine Learning Literacy Interventions on Laypeople's Reliance on Machine Learning Models. By Chun-Wei Chiang, Ming Yin (IUI, 2022)",
 'Machine Learning in Space: A Review of Machine Learning Algorithms and Hardware for Space Applications. By James Murphy, John E. Ward, Brian Mac Namee (AICS, 2021)',
 'Critical Tools for Machine Learning: Working with Intersectional Critical Concepts in Machine Learning Systems Design. By Goda Klumbyte, Claude Draude, Alex S. Taylor (FAccT, 2022)',
 'BetaML: The Beta Machine Learning Toolkit, a self-contained repository of Machine Learning algorithms in Julia. By Antonello Lobianco (J. Open Source Softw., 2021)',
 'Can machine learning consistently improve the scoring power of classical scoring functions? Insights into the role of machine learning in scoring functions. By Chao Shen, Ye Hu, Zhe Wang 0041, Xujun Zhang, Haiyang Zhong, Gaoang Wang, Xiaojun Yao, Lei Xu 0035, Dong-Sheng Cao 0001, Tingjun Hou (Briefings Bioinform., 2021)',
 'Three Differential Emotion Classification by Machine Learning Algorithms using Physiological Signals - Discriminantion of Emotions by Machine Learning Algorithms. By Eun-Hye Jang, Byoung-Jun Park, Sang-Hyeob Kim, Jin-Hun Sohn (ICAART (1), 2012)',
 'The Silent Problem - Machine Learning Model Failure - How to Diagnose and Fix Ailing Machine Learning Models. By Michele Bennett, Jaya Balusu, Karin Hayes, Ewa J. Kleczyk (CoRR, 2022)',
 'Trustless Machine Learning Contracts; Evaluating and Exchanging Machine Learning Models on the Ethereum Blockchain. By A. Besir Kurtulmus, Kenny Daniel (CoRR, 2018)',
 'Machine Learning Against Cancer: Accurate Diagnosis of Cancer by Machine Learning Classification of the Whole Genome Sequencing Data. By Arash Hooshmand (CoRR, 2020)',
 'Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence. By Sebastian Raschka, Joshua Patterson, Corey Nolet (Inf., 2020)',
 'Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence. By Sebastian Raschka, Joshua Patterson, Corey Nolet (CoRR, 2020)',
 'Teaching Machine Learning to Computer Science Preservice Teachers: Human vs. Machine Learning. By Koby Mike, Rinat B. Rosenberg-Kima (SIGCSE, 2021)',
 'Linear Algebra and Optimization with Applications to Machine Learning - Volume II: Fundamentals of Optimization Theory with Applications to Machine Learning By Jean H. Gallier, Jocelyn Quaintance (Linear Algebra and Optimization with Applications to Machine Learning, Volume II, 2020)',
 'Machine Learning in Textual Criticism: An examination of the performance of supervised machine learning algorithms in reconstructing the text of the Greek New Testament. By Mason Jones, Francesco Romano, Abidalrahman Mohd (ICMLT, 2022)',
 'Linear Algebra and Optimization with Applications to Machine Learning - Volume I: Linear Algebra for Computer Vision, Robotics, and Machine Learning By Jean H. Gallier, Jocelyn Quaintance (Linear Algebra and Optimization with Applications to Machine Learning, Volume I, 2020)',
 '"I Never Thought About Securing My Machine Learning Systems": A Study of Security and Privacy Awareness of Machine Learning Practitioners. By Franziska Boenisch, Verena Battis, Nicolas Buchmann, Maija Poikela (MuC, 2021)',
 'Métodos de Machine Learning Aplicados no Cenário da Educaçáo a Distância Brasileira (Machine Learning Techniques Applied to the Brazilian Distance Education). By Charles Nicollas C. Freitas, Roberta M. M. Gouveia, Rodrigo G. F. Soares (SIIE, 2020)',
 'When Lempel-Ziv-Welch Meets Machine Learning: A Case Study of Accelerating Machine Learning using Coding. By Fengan Li, Lingjiao Chen, Arun Kumar 0001, Jeffrey F. Naughton, Jignesh M. Patel, Xi Wu 0001 (CoRR, 2017)',
 'Machine Learning approach to Secure Software Defined Network: Machine Learning and Artificial Intelligence. By Afaf D. Althobiti, Rabab M. Almohayawi, Omaimah Bamasag (ICFNDS, 2020)',
 'Preserving User Privacy for Machine Learning: Local Differential Privacy or Federated Machine Learning? By Huadi Zheng, Haibo Hu 0001, Ziyang Han (IEEE Intell. Syst., 2020)',
 "Machine Learning Explanations as Boundary Objects: How AI Researchers Explain and Non-Experts Perceive Machine Learning. By Amid Ayobi, Katarzyna Stawarz, Dmitri S. Katz, Paul Marshall, Taku Yamagata, Raúl Santos-Rodríguez, Peter A. Flach, Aisling Ann O'Kane (IUI Workshops, 2021)",
 'The Holy Grail of "Systems for Machine Learning": Teaming humans and machine learning for detecting cyber threats. By Ignacio Arnaldo, Kalyan Veeramachaneni (SIGKDD Explor., 2019)',
 'Sports and machine learning: How young people can use data from their own bodies to learn about machine learning. By Abigail Zimmermann-Niefield, R. Benjamin Shapiro, Shaun K. Kane (XRDS, 2019)',
 'Tracking machine learning models for pandemic scenarios: a systematic review of machine learning models that predict local and global evolution of pandemics. By Marcelo Benedeti Palermo, Lucas Micol Policarpo, Cristiano André da Costa, Rodrigo da Rosa Righi (Netw. Model. Anal. Health Informatics Bioinform., 2022)',
 'Current Advances, Trends and Challenges of Machine Learning and Knowledge Extraction: From Machine Learning to Explainable AI. By Andreas Holzinger, Peter Kieseberg, Edgar R. Weippl, A Min Tjoa (CD-MAKE, 2018)',
 'Modelling of Received Signals in Molecular Communication Systems based machine learning: Comparison of azure machine learning and Python tools. By Soha Mohamed, Mahmoud S. Fayed (CoRR, 2021)',
 'High Value Media Monitoring With Machine Learning - Using Machine Learning to Drive Cost Effectiveness in an Established Business. By Matti Lyra, Daoud Clarke, Hamish Morgan, Jeremy Reffin, David J. Weir (Künstliche Intell., 2013)',
 'BPMN4sML: A BPMN Extension for Serverless Machine Learning. Technology Independent and Interoperable Modeling of Machine Learning Workflows and their Serverless Deployment Orchestration. By Laurens Martin Tetzlaff (CoRR, 2022)',
 'Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets. By Zhenxing Wu, Minfeng Zhu, Yu Kang 0002, Elaine Lai-Han Leung, Tailong Lei, Chao Shen, Dejun Jiang 0002, Zhe Wang 0041, Dong-Sheng Cao 0001, Tingjun Hou (Briefings Bioinform., 2021)',
 'Ethem Alpaydin. Introduction to Machine Learning (Adaptive Computation and Machine Learning Series). The MIT Press, 2004, ISBN 0 262 01211 1. By Shahzad Khan (Nat. Lang. Eng., 2008)',
 'Bagging Machine Learning Algorithms: A Generic Computing Framework Based on Machine-Learning Methods for Regional Rainfall Forecasting in Upstate New York. By Ning Yu, Timothy Haskins (Informatics, 2021)',
 'Artificial Intelligence, Machine Learning, and Signal Processing: Researchers are using artificial intelligence, machine learning, and signal processing to build powerful three-level platforms to help meet project goals [Special Reports]. By John Edwards (IEEE Signal Process. Mag., 2021)',
 'Automated Essay Scoring (AES); A Semantic Analysis Inspired Machine Learning Approach: An automated essay scoring system using semantic analysis and machine learning is presented in this research. By Ahsan Ikram, Billy Castle (ICETC, 2020)',
 'Digital Phenotyping and Machine Learning in the Next Generation of Digital Health Technologies: Utilising Event Logging, Ecological Momentary Assessment & Machine Learning. By Maurice D. Mulvenna (ICT4AWE, 2020)',
 'Can Machine Learning Correct Commonly Accepted Knowledge and Provide Understandable Knowledge in Care Support Domain? Tackling Cognitive Bias and Humanity from Machine Learning Perspective. By Keiki Takadama (AAAI Spring Symposia, 2018)',
 'Brain Tumor Classification using Machine Learning and Deep Learning Algorithms: A Comparison: Classifying brain MRI images on thebasis of location of tumor and comparingthe various Machine Learning and Deep LEARNING models used to predict best performance. By Ananya Joshi, Vipasha Rana, Aman Sharma (IC3, 2022)',
 'Motion Evaluation of Therapy Exercises by Means of Skeleton Normalisation, Incremental Dynamic Time Warping and Machine Learning: A Comparison of a Rule-Based and a Machine-Learning-Based Approach. By Julia Richter, Christian Wiede, Ulrich Heinkel, Gangolf Hirtz (VISIGRAPP (4: VISAPP), 2019)']

OK, this seems to go everywhere. Maybe we can narrow with a more specific request.

[13]:
gismo.rank("Machine Learning and covid-19")
[13]:
True
[14]:
gismo.get_documents_by_rank()
[14]:
['Ergonomics of Virtual Learning During COVID-19. By Lu Yuan, Alison Garaudy (AHFE (11), 2021)',
 'University Virtual Learning in Covid Times. By Verónica Marín-Díaz, Eloísa Reche, Javier Martín (Technol. Knowl. Learn., 2022)',
 'DCML: Deep contrastive mutual learning for COVID-19 recognition. By Hongbin Zhang, Weinan Liang, Chuanxiu Li, Qipeng Xiong, Haowei Shi, Lang Hu, Guangli Li (Biomed. Signal Process. Control., 2022)',
 'Interpretable Sequence Learning for Covid-19 Forecasting. By Sercan Ömer Arik, Chun-Liang Li, Jinsung Yoon, Rajarishi Sinha, Arkady Epshteyn, Long T. Le, Vikas Menon, Shashank Singh 0005, Leyou Zhang, Martin Nikoltchev, Yash Sonthalia, Hootan Nakhost, Elli Kanal, Tomas Pfister (NeurIPS, 2020)',
 'Interpretable Sequence Learning for COVID-19 Forecasting. By Sercan Ömer Arik, Chun-Liang Li, Jinsung Yoon, Rajarishi Sinha, Arkady Epshteyn, Long T. Le, Vikas Menon, Shashank Singh 0005, Leyou Zhang, Nate Yoder, Martin Nikoltchev, Yash Sonthalia, Hootan Nakhost, Elli Kanal, Tomas Pfister (CoRR, 2020)',
 'The Deaf Experience in Remote Learning during COVID-19. By Yosra Bouzid, Mohamed Jemni (ICTA, 2021)',
 'The Study on the Efficiency of Smart Learning in the COVID-19. By Seong-Kyu Kim, Mi-Jung Lee, Eun-Sill Jang, Young-Eun Lee (J. Multim. Inf. Syst., 2022)',
 'Dual Teaching: Simultaneous Remote and In-Person Learning During COVID. By Hunter M. Williams, Malcolm Haynes, Joseph Kim (SIGITE, 2021)',
 'An Analysis of the Effectiveness of Emergency Distance Learning under COVID-19. By Ngo Tung Son, Bui Ngoc Anh, Kieu Quoc Tuan, Son Ba Nguyen, Son Hoang Nguyen, Jafreezal Jaafar (CCRIS, 2020)',
 'Automated Machine Learning for COVID-19 Forecasting. By Jaco Tetteroo, Mitra Baratchi, Holger H. Hoos (IEEE Access, 2022)',
 "Unsupervised Convolutional Filter Learning for COVID-19 Classification. By Sakthi Ganesh Mahalingam, Saichandra Pandraju (Rev. d'Intelligence Artif., 2021)",
 'A Data Augmented Approach to Transfer Learning for Covid-19 Detection. By Shagufta Henna, Aparna Reji (CoRR, 2021)',
 'Academic Procrastination and Online Learning During the COVID-19 Pandemic. By Jørgen Melgaard, Rubina Monir, Lester Allan Lasrado, Asle Fagerstrøm (CENTERIS/ProjMAN/HCist, 2021)',
 'A comprehensive review of federated learning for COVID-19 detection. By Sadaf Naz, Khoa Tran Phan, Yi-Ping Phoebe Chen (Int. J. Intell. Syst., 2022)',
 'M-learning in the COVID-19 era: physical vs digital class. By Vasiliki Matzavela, Efthimios Alepis (Educ. Inf. Technol., 2021)',
 'Challenges of Online Learning During the COVID-19: What Can We Learn on Twitter? By Wei Quan (ARTIIS, 2021)',
 'Exploring the Effectiveness of Fully Online Translation Learning During COVID-19. By Wenchao Su, Defeng Li, Xiaoxing Zhao, Ruilin Li (ICWL/SETE, 2020)',
 'Educational Transformation: An Evaluation of Online Learning Due To COVID-19. By Rizky Firmansyah, Dhika Maha Putri, Mochammad Galih Satriyo Wicaksono, Sheila Febriani Putri, Ahmad Arif Widianto, Mohd Rizal Palil (Int. J. Emerg. Technol. Learn., 2021)',
 'Adoption, use and enhancement of virtual learning during COVID-19. By Munyaradzi Zhou, Canicio Dzingirai, Kudakwashe Hove, Tavengwa Chitata, Raymond Mugandani (Educ. Inf. Technol., 2022)',
 'Online Learning Before, During and After COVID-19: Observations Over 20 Years. By Natalie Wieland, Liz Kollias (Int. J. Adv. Corp. Learn., 2020)',
 'Attitudes Towards Online Learning During COVID-19: A Cluster and Sentiment Analyses. By Rex Perez Bringula, Ma. Teresa Borebor, Ma. Ymelda C. Batalla (MLIS, 2022)',
 'The Effects of the Sudden Switch to Remote Learning Due to Covid-19 on HBCU Students and Faculty. By Mariele Ponticiello, Mariah Simmons, Joon-Suk Lee (HCI (23), 2021)',
 'Dynamic-Fusion-Based Federated Learning for COVID-19 Detection. By Weishan Zhang, Tao Zhou, Qinghua Lu 0001, Xiao Wang 0002, Chunsheng Zhu, Haoyun Sun, Zhipeng Wang, Sin Kit Lo, Fei-Yue Wang 0001 (IEEE Internet Things J., 2021)',
 'Dynamic Fusion based Federated Learning for COVID-19 Detection. By Weishan Zhang, Tao Zhou, Qinghua Lu 0001, Xiao Wang 0002, Chunsheng Zhu, Haoyun Sun, Zhipeng Wang, Sin Kit Lo, Feiyue Wang 0001 (CoRR, 2020)',
 'Using Mobile ICT for Online Learning During COVID-19 Lockdown. By Viktoriia Tkachuk, Yuliia V. Yechkalo, Serhiy Semerikov, Maria Kislova, Yana Hladyr (ICTERI (Revised Selected Papers), 2020)',
 "Analysis of Teachers' Satisfaction With Online Learning During the Covid-19 Pandemic. By Yaser Saleh, Nuha El-Khalili, Nesreen A. Otoum, Mohammad Al-Sheikh Hasan, Saif Abu-Aishah, Izzeddin Matar (IWSSIP, 2022)",
 "Teachers' emotions, technostress, and burnout in distance learning during the COVID-19 pandemic. By Francesco Sulla, Benedetta Ragni, Miriana D'Angelo, Dolores Rollo (teleXbe, 2022)",
 "Teachers' Difficulties in Implementing Distance Learning during Covid-19 Pandemic. By Nana Diana, Suhendra Suhendra, Yohannes Yohannes (ICETC, 2020)",
 'Student Emotions in the Shift to Online Learning During the COVID-19 Pandemic. By Danielle P. Espino, Tiffany Wright, Victoria M. Brown, Zachariah Mbasu, Matthew Sweeney, Seung B. Lee (ICQE, 2020)',
 'CNN-based Transfer Learning for Covid-19 Diagnosis. By Zanear Sh. Ahmed, Nigar M. Shafiq Surameery, Rasber Dh. Rashid, Shadman Q. Salih, Hawre Kh. Abdulla (ICIT, 2021)',
 'Federated learning for COVID-19 screening from Chest X-ray images. By Ines Feki, Sourour Ammar, Yousri Kessentini, Khan Muhammad (Appl. Soft Comput., 2021)',
 'University-Wide Online Learning During COVID-19: From Policy to Practice. By Nuengwong Tuaycharoen (Int. J. Interact. Mob. Technol., 2021)',
 'Mobile Technology for Learning During Covid-19: Opportunities, Lessons, and Challenges. By Oluwakemi Fasae, Femi Alufa, Victor Ayodele, Akachukwu Okoli, Opeyemi Dele-Ajayi (IMCL, 2021)',
 'Optimal policy learning for COVID-19 prevention using reinforcement learning. By Muhammad Irfan Uddin, Syed Atif Ali Shah, Mahmoud Ahmad Al-Khasawneh, Ala Abdulsalam Alarood, Eesa Alsolami (J. Inf. Sci., 2022)',
 'Semi-supervised Learning for COVID-19 Image Classification via ResNet. By Lucy Nwuso, Xiangfang Li, Lijun Qian, Seungchan Kim, Xishuang Dong (CoRR, 2021)',
 'Using Deep Learning for COVID-19 Control: Implementing a Convolutional Neural Network in a Facemask Detection Application. By Caolan Deery, Kevin Meehan (SmartNets, 2021)',
 'Boosting Deep Transfer Learning For Covid-19 Classification. By Fouzia Altaf, Syed M. S. Islam, Naeem Khalid Janjua, Naveed Akhtar (ICIP, 2021)',
 'Boosting Deep Transfer Learning for COVID-19 Classification. By Fouzia Altaf, Syed M. S. Islam, Naeem Khalid Janjua, Naveed Akhtar (CoRR, 2021)',
 'Applications of machine learning for COVID-19 misinformation: a systematic review. By A. R. Sanaullah, Anupam Das 0006, Anik Das, Muhammad Ashad Kabir, Kai Shu (Soc. Netw. Anal. Min., 2022)',
 'A Survey on Deep Learning and Machine Learning for COVID-19 Detection. By Mohamed M. Dessouky, Sahar F. Sabbeh, Boushra Alshehri (ICFNDS, 2021)',
 "Evaluating Students' Aprehension About Remote Learning During the COVID-19 Pandemic: a Brazilian Sample. By Wesley Machado, Cassia Isac, Thiago Franco Leal, Luiz Couto, David Silva (LWMOOCS, 2020)"]

Sounds nice. How are the top-10 articles related? Note: as the graph structure is really sparse on the document side (10 features), it is best to de-activate the query-distortion, which is intended for longer documents.

[15]:
gismo.parameters.distortion = 0.0
gismo.get_documents_by_cluster(k=10)
 F: 0.45. R: 0.01. S: 0.74.
- F: 0.45. R: 0.01. S: 0.73.
-- F: 0.45. R: 0.00. S: 0.66.
--- F: 0.66. R: 0.00. S: 0.53.
---- Ergonomics of Virtual Learning During COVID-19. (R: 0.00; S: 0.63)
---- University Virtual Learning in Covid Times. (R: 0.00; S: 0.35)
--- DCML: Deep contrastive mutual learning for COVID-19 recognition. (R: 0.00; S: 0.60)
--- Dual Teaching: Simultaneous Remote and In-Person Learning During COVID. (R: 0.00; S: 0.35)
--- An Analysis of the Effectiveness of Emergency Distance Learning under COVID-19. (R: 0.00; S: 0.56)
-- F: 0.70. R: 0.00. S: 0.60.
--- F: 1.00. R: 0.00. S: 0.54.
---- Interpretable Sequence Learning for Covid-19 Forecasting. (R: 0.00; S: 0.54)
---- Interpretable Sequence Learning for COVID-19 Forecasting. (R: 0.00; S: 0.54)
--- Automated Machine Learning for COVID-19 Forecasting. (R: 0.00; S: 0.59)
-- The Deaf Experience in Remote Learning during COVID-19. (R: 0.00; S: 0.52)
- The Study on the Efficiency of Smart Learning in the COVID-19. (R: 0.00; S: 0.50)

Now, let’s look at the main keywords.

[16]:
gismo.get_features_by_rank(20)
[16]:
['covid',
 '19',
 'covid 19',
 'learning covid',
 'machine',
 'machine learning',
 'pandemic',
 '19 pandemic',
 'online learning',
 'online',
 'chest',
 'deep',
 '19 detection',
 'deep learning',
 'student',
 'distance learning',
 'ray',
 'chest ray',
 'classification',
 'ct']

Let’s organize them.

[17]:
# On the feature side, the graph is more dense so we can use query distortion
gismo.get_features_by_cluster(distortion=1)
 F: 0.31. R: 0.06. S: 0.97.
- F: 0.46. R: 0.06. S: 0.97.
-- F: 0.55. R: 0.05. S: 0.97.
--- F: 0.96. R: 0.05. S: 0.97.
---- covid (R: 0.01; S: 1.00)
---- 19 (R: 0.01; S: 0.99)
---- covid 19 (R: 0.01; S: 0.99)
---- learning covid (R: 0.01; S: 0.97)
--- F: 0.97. R: 0.01. S: 0.55.
---- pandemic (R: 0.00; S: 0.56)
---- 19 pandemic (R: 0.00; S: 0.54)
-- F: 0.95. R: 0.00. S: 0.44.
--- online learning (R: 0.00; S: 0.44)
--- online (R: 0.00; S: 0.45)
- F: 1.00. R: 0.01. S: 0.31.
-- machine (R: 0.00; S: 0.31)
-- machine learning (R: 0.00; S: 0.31)

Rough, very broad analysis: - One big keyword cluster about Coronavirus / Covid-19, pandemic, online learning; - Machine Learning as a separate small cluster.

[18]:
np.dot(gismo.embedding.query_projection("Machine learning")[0], gismo.embedding.y)
[18]:
<1x6232511 sparse matrix of type '<class 'numpy.float64'>'
        with 88256 stored elements in Compressed Sparse Row format>

88,000 articles with an explicit link to machine learning.

[19]:
np.dot(gismo.embedding.query_projection("Covid-19")[0], gismo.embedding.y)
[19]:
<1x6232511 sparse matrix of type '<class 'numpy.float64'>'
        with 11831 stored elements in Compressed Sparse Row format>

12,000 articles with an explicit link to covid-19.

Authors query

Instead of looking at words, we can explore authors and their collaborations.

We just have to rewire the corpus to output string of authors.

[20]:
def to_authors_text(dic):
    return " ".join([a.replace(' ', '_') for a in dic['authors']])
corpus = Corpus(source, to_text=to_authors_text)

We can build a new embedding on top of this modified corpus. We tell the vectorizer to be stupid: don’t preprocess, words are separated spaces.

This will take a few minutes (you can save the embedding for later if you want).

[21]:
vectorizer = CountVectorizer(dtype=float,
                            preprocessor=lambda x:x, tokenizer=lambda x: x.split(' '))
try:
    a_embedding = Embedding.load(filename="dblp_aut_embedding", path=path)
except:
    a_embedding = Embedding(vectorizer=vectorizer)
    a_embedding.fit_transform(corpus)
    a_embedding.dump(filename="dblp_aut_embedding", path=path)
[22]:
a_embedding.x
[22]:
<6232511x3200857 sparse matrix of type '<class 'numpy.float64'>'
        with 20237296 stored elements in Compressed Sparse Row format>

We now have about 3,200,000 authors to explore. Let’s reload gismo and try to play.

[23]:
gismo = Gismo(corpus, a_embedding)
gismo.post_documents_item = post_article
gismo.post_features_item = lambda g, i: g.embedding.features[i].replace("_", " ")
[24]:
gismo.post_documents_cluster = partial(post_documents_cluster_print, post_item=post_meta)
gismo.post_features_cluster = post_features_cluster_print

Laurent Massoulié query

[25]:
gismo.rank("Laurent_Massoulié")
[25]:
True

What are the most central articles of Laurent Massoulié in terms of collaboration?

[26]:
gismo.get_documents_by_rank(k=10)
[26]:
['Asynchrony and Acceleration in Gossip Algorithms. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2020)',
 'Decentralized Optimization with Heterogeneous Delays: a Continuous-Time Approach. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2021)',
 'Scalable Local Area Service Discovery. By Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007)',
 'A spectral method for community detection in moderately-sparse degree-corrected stochastic block models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2016)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (ITCS, 2017)',
 'An Impossibility Result for Reconstruction in a Degree-Corrected Planted-Partition Model. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015)',
 'From tree matching to sparse graph alignment. By Luca Ganassali, Laurent Massoulié (CoRR, 2020)',
 'From tree matching to sparse graph alignment. By Luca Ganassali, Laurent Massoulié (COLT, 2020)',
 'Optimal Content Placement for Peer-to-Peer Video-on-Demand Systems. By Bo Tan 0002, Laurent Massoulié (IEEE/ACM Trans. Netw., 2013)']

We see lots of duplicates. This is not surprising as many articles can published first as a research report, then as a conference paper, last as a journal article. Luckily, Gismo can cover for you.

[27]:
gismo.get_documents_by_coverage(k=10)
[27]:
['Asynchrony and Acceleration in Gossip Algorithms. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2020)',
 'Decentralized Optimization with Heterogeneous Delays: a Continuous-Time Approach. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2021)',
 'Scalable Local Area Service Discovery. By Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007)',
 'A spectral method for community detection in moderately-sparse degree-corrected stochastic block models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2016)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (ITCS, 2017)',
 'An Impossibility Result for Reconstruction in a Degree-Corrected Planted-Partition Model. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015)',
 'From tree matching to sparse graph alignment. By Luca Ganassali, Laurent Massoulié (CoRR, 2020)',
 'From tree matching to sparse graph alignment. By Luca Ganassali, Laurent Massoulié (COLT, 2020)',
 'Optimal Content Placement for Peer-to-Peer Video-on-Demand Systems. By Bo Tan 0002, Laurent Massoulié (IEEE/ACM Trans. Netw., 2013)']

Hum, not working well. The reason here is query distortion. Query distortion is a gismo feature that modulates the clustering with the query. Sadly, when features are authors, the underlying graph has a very specific structure (highly sparse and redundant) that makes query distortion too effective. The solution is to desactivate it.

[28]:
gismo.parameters.distortion = 0
gismo.get_documents_by_coverage(k=10)
[28]:
['Asynchrony and Acceleration in Gossip Algorithms. By Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2020)',
 'Scalable Local Area Service Discovery. By Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007)',
 'A spectral method for community detection in moderately-sparse degree-corrected stochastic block models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015)',
 'From tree matching to sparse graph alignment. By Luca Ganassali, Laurent Massoulié (CoRR, 2020)',
 'Optimal Content Placement for Peer-to-Peer Video-on-Demand Systems. By Bo Tan 0002, Laurent Massoulié (IEEE/ACM Trans. Netw., 2013)',
 'Robustness of Spectral Methods for Community Detection. By Ludovic Stephan, Laurent Massoulié (COLT, 2019)',
 'Concentration of Non-Isotropic Random Tensors with Applications to Learning and Empirical Risk Minimization. By Mathieu Even, Laurent Massoulié (CoRR, 2021)',
 'Correlation Detection in Trees for Planted Graph Alignment. By Luca Ganassali, Laurent Massoulié, Marc Lelarge (ITCS, 2022)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2016)',
 'Non-Backtracking Spectrum of Degree-Corrected Stochastic Block Models. By Lennart Gulikers, Marc Lelarge, Laurent Massoulié (ITCS, 2017)']

Much better. No duplicate and more diversity in the results. Let’s observe the communities.

[29]:
gismo.get_documents_by_cluster(k=20, resolution=.9)
 F: 0.37. R: 0.07. S: 0.86.
- F: 0.37. R: 0.07. S: 0.86.
-- F: 0.81. R: 0.01. S: 0.63.
--- F: 1.00. R: 0.01. S: 0.56.
---- Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2020) (R: 0.00; S: 0.56)
---- Mathieu Even, Hadrien Hendrikx, Laurent Massoulié (CoRR, 2021) (R: 0.00; S: 0.56)
--- F: 1.00. R: 0.01. S: 0.63.
---- Mathieu Even, Laurent Massoulié (CoRR, 2021) (R: 0.00; S: 0.63)
---- Mathieu Even, Laurent Massoulié (COLT, 2021) (R: 0.00; S: 0.63)
-- F: 0.37. R: 0.06. S: 0.81.
--- F: 0.50. R: 0.04. S: 0.71.
---- F: 1.00. R: 0.01. S: 0.59.
----- Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015) (R: 0.00; S: 0.59)
----- Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2016) (R: 0.00; S: 0.59)
----- Lennart Gulikers, Marc Lelarge, Laurent Massoulié (ITCS, 2017) (R: 0.00; S: 0.59)
----- Lennart Gulikers, Marc Lelarge, Laurent Massoulié (CoRR, 2015) (R: 0.00; S: 0.59)
---- F: 0.84. R: 0.03. S: 0.64.
----- F: 1.00. R: 0.01. S: 0.64.
------ Luca Ganassali, Laurent Massoulié (CoRR, 2020) (R: 0.00; S: 0.64)
------ Luca Ganassali, Laurent Massoulié (COLT, 2020) (R: 0.00; S: 0.64)
----- F: 1.00. R: 0.02. S: 0.60.
------ Luca Ganassali, Laurent Massoulié, Marc Lelarge (ITCS, 2022) (R: 0.00; S: 0.60)
------ Luca Ganassali, Laurent Massoulié, Marc Lelarge (CoRR, 2021) (R: 0.00; S: 0.60)
------ Luca Ganassali, Marc Lelarge, Laurent Massoulié (CoRR, 2019) (R: 0.00; S: 0.60)
------ Luca Ganassali, Laurent Massoulié, Marc Lelarge (COLT, 2021) (R: 0.00; S: 0.60)
------ Luca Ganassali, Laurent Massoulié, Marc Lelarge (CoRR, 2021) (R: 0.00; S: 0.60)
--- F: 0.37. R: 0.01. S: 0.70.
---- F: 1.00. R: 0.01. S: 0.61.
----- Bo Tan 0002, Laurent Massoulié (IEEE/ACM Trans. Netw., 2013) (R: 0.00; S: 0.61)
----- Bo Tan 0002, Laurent Massoulié (INFOCOM, 2011) (R: 0.00; S: 0.61)
----- Bo Tan 0002, Laurent Massoulié (PODC, 2010) (R: 0.00; S: 0.61)
---- Ludovic Stephan, Laurent Massoulié (COLT, 2019) (R: 0.00; S: 0.61)
- Richard Black, Heimir Sverrisson, Laurent Massoulié (ICC, 2007) (R: 0.00; S: 0.46)

OK! We see that the articles are organized by writing commmunities. Also note how Gismo managed to organize a hierachical grouping of the communities.

Now, let’s look in terms of authors. This is actually the interesting part when studying collaborations.

[30]:
gismo.get_features_by_rank()
[30]:
['Laurent Massoulié',
 'Marc Lelarge',
 'Stratis Ioannidis',
 'Hadrien Hendrikx',
 'Nidhi Hegde 0001',
 'Peter B. Key',
 'Francis R. Bach',
 'Anne-Marie Kermarrec',
 'Ayalvadi J. Ganesh',
 'Luca Ganassali',
 'Mathieu Even',
 'Lennart Gulikers',
 'Milan Vojnovic',
 'Dan-Cristian Tomozei',
 'Amin Karbasi',
 'Augustin Chaintreau',
 'Mathieu Leconte',
 'Bo Tan 0002',
 'James Roberts',
 'Rémi Varloot',
 'Kuang Xu']

We see many authors that were not present in the articles listed above. This is an important observation: central articles (with respect to a query) are not necessarily written by central authors!

Let’s organize them into communities.

[31]:
gismo.get_features_by_cluster(resolution=.6)
 F: 0.00. R: 0.21. S: 0.54.
- F: 0.00. R: 0.21. S: 0.54.
-- F: 0.01. R: 0.21. S: 0.54.
--- F: 0.01. R: 0.20. S: 0.54.
---- F: 0.01. R: 0.20. S: 0.55.
----- F: 0.02. R: 0.18. S: 0.54.
------ F: 0.03. R: 0.17. S: 0.52.
------- F: 0.08. R: 0.13. S: 0.42.
-------- F: 0.15. R: 0.12. S: 0.41.
--------- Laurent Massoulié (R: 0.10; S: 1.00)
--------- Marc Lelarge (R: 0.01; S: 0.17)
--------- Luca Ganassali (R: 0.01; S: 0.19)
-------- F: 0.12. R: 0.01. S: 0.22.
--------- Lennart Gulikers (R: 0.00; S: 0.22)
--------- Milan Vojnovic (R: 0.00; S: 0.06)
------- F: 0.10. R: 0.02. S: 0.27.
-------- F: 0.35. R: 0.01. S: 0.27.
--------- Hadrien Hendrikx (R: 0.01; S: 0.26)
--------- Mathieu Even (R: 0.00; S: 0.20)
-------- Francis R. Bach (R: 0.01; S: 0.07)
------- F: 0.08. R: 0.02. S: 0.18.
-------- F: 0.12. R: 0.01. S: 0.17.
--------- Peter B. Key (R: 0.01; S: 0.12)
--------- Ayalvadi J. Ganesh (R: 0.01; S: 0.14)
-------- Anne-Marie Kermarrec (R: 0.01; S: 0.07)
------ Nidhi Hegde 0001 (R: 0.01; S: 0.16)
------ Mathieu Leconte (R: 0.00; S: 0.09)
----- F: 0.02. R: 0.02. S: 0.11.
------ F: 0.07. R: 0.01. S: 0.11.
------- Stratis Ioannidis (R: 0.01; S: 0.11)
------- Augustin Chaintreau (R: 0.00; S: 0.05)
------ Amin Karbasi (R: 0.00; S: 0.04)
---- Dan-Cristian Tomozei (R: 0.00; S: 0.08)
---- Bo Tan 0002 (R: 0.00; S: 0.11)
--- Rémi Varloot (R: 0.00; S: 0.09)
-- James Roberts (R: 0.00; S: 0.05)
- Kuang Xu (R: 0.00; S: 0.03)

Jim Roberts query

[32]:
gismo.rank("James_W._Roberts")
[32]:
True

Let’s have a covering set of articles.

[33]:
gismo.get_documents_by_coverage(k=10)
[33]:
['Integrated Admission Control for Streaming and Elastic Traffic. By Nabil Benameur, Slim Ben Fredj, Frank Delcoigne, Sara Oueslati-Boulahia, James W. Roberts (QofIS, 2001)',
 'An In-Camera Data Stream Processing System for Defect Detection in Web Inspection Tasks. By S. Hossain Hajimowlana, Roberto Muscedere, Graham A. Jullien, James W. Roberts (Real Time Imaging, 1999)',
 "Modifications of Thomae's Function and Differentiability. By Kevin Beanland, James W. Roberts, Craig Stevenson (Am. Math. Mon., 2009)",
 'A Traffic Control Framework for High Speed Data Transmission. By James W. Roberts, Brahim Bensaou, Y. Canetti (Modelling and Evaluation of ATM Networks, 1993)',
 'Statistical bandwidth sharing: a study of congestion at flow level. By Slim Ben Fredj, Thomas Bonald, Alexandre Proutière, G. Régnié, James W. Roberts (SIGCOMM, 2001)',
 'Swing: Traffic capacity of a simple WDM ring network. By Thomas Bonald, Sara Oueslati, James W. Roberts, Charlotte Roger (ITC, 2009)',
 'Impact of "Trunk Reservation" on Elastic Flow Routing. By Sara Oueslati-Boulahia, James W. Roberts (NETWORKING, 2000)',
 'Comment on "Datacenter Congestion Control: Identifying what is essential and making it practical" by Aisha Mushtaq, et al, CCR, July 2019. By James W. Roberts (Comput. Commun. Rev., 2020)',
 'QoS Guarantees for Shaped Bit Rate Video Connections in Broadband Networks. By Maher Hamdi, James W. Roberts (MMNET, 1995)',
 'Multi-Resource Fairness: Objectives, Algorithms and Performance. By Thomas Bonald, James W. Roberts (SIGMETRICS, 2015)']

Who are the associated authors?

[34]:
gismo.get_features_by_rank(k=10)
[34]:
['James W. Roberts',
 'Thomas Bonald',
 'Maher Hamdi',
 'Sara Oueslati-Boulahia',
 'Ali Ibrahim',
 'Alexandre Proutière',
 'Sara Oueslati',
 'Jorma T. Virtamo',
 'Slim Ben Fredj',
 'Jussi Kangasharju']

Let’s organize them.

[35]:
gismo.get_features_by_cluster(k=10, resolution=.4)
 F: 0.01. R: 0.24. S: 0.51.
- F: 0.01. R: 0.23. S: 0.51.
-- F: 0.04. R: 0.21. S: 0.50.
--- F: 0.15. R: 0.20. S: 0.50.
---- F: 0.23. R: 0.18. S: 0.90.
----- James W. Roberts (R: 0.14; S: 1.00)
----- Thomas Bonald (R: 0.05; S: 0.25)
---- F: 0.57. R: 0.01. S: 0.32.
----- Sara Oueslati-Boulahia (R: 0.01; S: 0.27)
----- Slim Ben Fredj (R: 0.01; S: 0.30)
---- Sara Oueslati (R: 0.01; S: 0.15)
--- Alexandre Proutière (R: 0.01; S: 0.04)
-- Maher Hamdi (R: 0.01; S: 0.12)
-- Ali Ibrahim (R: 0.01; S: 0.06)
- F: 0.01. R: 0.01. S: 0.05.
-- Jorma T. Virtamo (R: 0.01; S: 0.04)
-- Jussi Kangasharju (R: 0.00; S: 0.03)

Combined queries

We can input multiple authors.

[36]:
gismo.rank("Laurent_Massoulié and James_W._Roberts")
[36]:
True

Let’s have a covering set of articles.

[37]:
gismo.get_documents_by_coverage(k=10)
[37]:
['Integrated Admission Control for Streaming and Elastic Traffic. By Nabil Benameur, Slim Ben Fredj, Frank Delcoigne, Sara Oueslati-Boulahia, James W. Roberts (QofIS, 2001)',
 'Defect detection in web inspection using fuzzy fusion of texture features. By S. Hossain Hajimowlana, Roberto Muscedere, Graham A. Jullien, James W. Roberts (ISCAS, 2000)',
 "Modifications of Thomae's Function and Differentiability. By Kevin Beanland, James W. Roberts, Craig Stevenson (Am. Math. Mon., 2009)",
 'A Traffic Control Framework for High Speed Data Transmission. By James W. Roberts, Brahim Bensaou, Y. Canetti (Modelling and Evaluation of ATM Networks, 1993)',
 'Statistical bandwidth sharing: a study of congestion at flow level. By Slim Ben Fredj, Thomas Bonald, Alexandre Proutière, G. Régnié, James W. Roberts (SIGCOMM, 2001)',
 'Swing: Traffic capacity of a simple WDM ring network. By Thomas Bonald, Sara Oueslati, James W. Roberts, Charlotte Roger (ITC, 2009)',
 'Impact of "Trunk Reservation" on Elastic Flow Routing. By Sara Oueslati-Boulahia, James W. Roberts (NETWORKING, 2000)',
 'Internet traffic, QoS, and pricing. By James W. Roberts (Proc. IEEE, 2004)',
 'QoS Guarantees for Shaped Bit Rate Video Connections in Broadband Networks. By Maher Hamdi, James W. Roberts (MMNET, 1995)',
 'Enhanced Cluster Computing Performance Through Proportional Fairness. By Thomas Bonald, James W. Roberts (CoRR, 2014)']

Note that we get here only articles by Roberts, yet the articles returned have sightly changed.

Now, let’s look at the main authors.

[38]:
gismo.get_features_by_rank()
[38]:
['James W. Roberts',
 'Laurent Massoulié',
 'Thomas Bonald',
 'Marc Lelarge',
 'Maher Hamdi',
 'Nidhi Hegde 0001',
 'Stratis Ioannidis',
 'Alexandre Proutière',
 'Sara Oueslati-Boulahia',
 'Hadrien Hendrikx',
 'Ali Ibrahim',
 'Peter B. Key',
 'Francis R. Bach',
 'Jorma T. Virtamo',
 'Sara Oueslati']

We see a mix of both co-authors. How are they organized?

[39]:
gismo.get_features_by_cluster(resolution=.4)
 F: 0.01. R: 0.20. S: 0.55.
- F: 0.02. R: 0.19. S: 0.55.
-- F: 0.05. R: 0.11. S: 0.50.
--- F: 0.23. R: 0.10. S: 0.50.
---- James W. Roberts (R: 0.07; S: 0.92)
---- Thomas Bonald (R: 0.03; S: 0.24)
---- Sara Oueslati-Boulahia (R: 0.00; S: 0.25)
--- Sara Oueslati (R: 0.00; S: 0.14)
-- F: 0.14. R: 0.05. S: 0.21.
--- F: 0.24. R: 0.05. S: 0.21.
---- Laurent Massoulié (R: 0.05; S: 0.39)
---- Hadrien Hendrikx (R: 0.00; S: 0.10)
--- Francis R. Bach (R: 0.00; S: 0.03)
-- Marc Lelarge (R: 0.01; S: 0.07)
-- Maher Hamdi (R: 0.01; S: 0.11)
-- F: 0.10. R: 0.01. S: 0.09.
--- Nidhi Hegde 0001 (R: 0.01; S: 0.08)
--- Alexandre Proutière (R: 0.00; S: 0.04)
-- Stratis Ioannidis (R: 0.00; S: 0.04)
-- Ali Ibrahim (R: 0.00; S: 0.06)
-- Peter B. Key (R: 0.00; S: 0.05)
- Jorma T. Virtamo (R: 0.00; S: 0.04)

Cross-gismo

Gismo can combine two embeddings two create one hybrid gismo. This is called a cross-gismo (XGismo). This features can be used to analyze authors with respect to the words they use (and vice-versa).

[40]:
from gismo.gismo import XGismo
gismo = XGismo(x_embedding=a_embedding, y_embedding=embedding)
gismo.diteration.n_iter = 2 # to speed up a little bit computation time

Note that XGismo does not use the underlying corpus, so we can now close the source (the source keeps the file dblp.data open).

[41]:
source.close()
[42]:
gismo.post_documents_item = lambda g, i: g.corpus[i].replace("_", " ")
gismo.post_features_cluster = post_features_cluster_print
gismo.post_documents_cluster = post_documents_cluster_print

Let’s try a request.

[43]:
gismo.rank("self-stabilization")
[43]:
True

What are the associated keywords?

[44]:
gismo.get_features_by_rank(k=10)
[44]:
['stabilization',
 'self',
 'self stabilization',
 'stabilize',
 'self stabilize',
 'distribute',
 'distributed',
 'robust',
 'sensor',
 'stabilizing']

How are keywords structured?

[45]:
gismo.get_features_by_cluster(k=20, resolution=.8)
 F: 0.35. R: 0.02. S: 0.80.
- F: 0.61. R: 0.02. S: 0.80.
-- F: 0.76. R: 0.02. S: 0.81.
--- F: 0.82. R: 0.01. S: 0.81.
---- F: 0.92. R: 0.01. S: 0.81.
----- stabilization (R: 0.00; S: 0.81)
----- self stabilization (R: 0.00; S: 0.81)
----- stabilizing (R: 0.00; S: 0.75)
---- F: 0.94. R: 0.00. S: 0.68.
----- sensor (R: 0.00; S: 0.68)
----- wireless (R: 0.00; S: 0.66)
---- fault (R: 0.00; S: 0.76)
--- F: 0.85. R: 0.01. S: 0.67.
---- F: 0.97. R: 0.01. S: 0.67.
----- self (R: 0.00; S: 0.76)
----- stabilize (R: 0.00; S: 0.69)
----- self stabilize (R: 0.00; S: 0.66)
---- distributed (R: 0.00; S: 0.70)
-- F: 0.79. R: 0.00. S: 0.46.
--- distribute (R: 0.00; S: 0.67)
--- robot (R: 0.00; S: 0.41)
--- byzantine (R: 0.00; S: 0.44)
--- optimal (R: 0.00; S: 0.64)
--- mobile (R: 0.00; S: 0.53)
- F: 0.54. R: 0.00. S: 0.36.
-- F: 0.65. R: 0.00. S: 0.44.
--- F: 0.68. R: 0.00. S: 0.38.
---- robust (R: 0.00; S: 0.44)
---- stability (R: 0.00; S: 0.32)
--- adaptive (R: 0.00; S: 0.51)
-- F: 0.54. R: 0.00. S: 0.15.
--- nonlinear (R: 0.00; S: 0.08)
--- linear (R: 0.00; S: 0.25)

Who are the associated researchers?

[46]:
gismo.get_documents_by_rank(k=10)
[46]:
['Ted Herman',
 'Shlomi Dolev',
 'Sébastien Tixeuil',
 'Sukumar Ghosh',
 'George Varghese',
 'Shay Kutten',
 'Toshimitsu Masuzawa',
 'Stefan Schmid 0001',
 'Swan Dubois',
 'Laurence Pilard']

How are they structured?

[47]:
gismo.get_documents_by_cluster(k=10, resolution=.9)
 F: 0.78. R: 0.05. S: 0.83.
- F: 0.93. R: 0.05. S: 0.83.
-- F: 0.98. R: 0.01. S: 0.82.
--- Ted Herman (R: 0.01; S: 0.82)
--- George Varghese (R: 0.00; S: 0.81)
-- F: 0.96. R: 0.02. S: 0.80.
--- F: 0.97. R: 0.01. S: 0.80.
---- Shlomi Dolev (R: 0.01; S: 0.78)
---- Toshimitsu Masuzawa (R: 0.00; S: 0.74)
---- Laurence Pilard (R: 0.00; S: 0.80)
--- Shay Kutten (R: 0.00; S: 0.82)
-- F: 0.94. R: 0.02. S: 0.82.
--- F: 0.98. R: 0.01. S: 0.81.
---- Sébastien Tixeuil (R: 0.01; S: 0.82)
---- Sukumar Ghosh (R: 0.01; S: 0.80)
--- Swan Dubois (R: 0.00; S: 0.81)
- Stefan Schmid 0001 (R: 0.00; S: 0.67)

We can also query researchers. Just use underscores in the query and add y=False to indicate that the input is documents.

[48]:
gismo.rank("Sébastien_Tixeuil and Fabien_Mathieu", y=False)
[48]:
True

What are the associated keywords?

[49]:
gismo.get_features_by_rank(k=10)
[49]:
['p2p',
 'grid',
 'byzantine',
 'stabilization',
 'reloaded',
 'refresh',
 'self',
 'self stabilization',
 'p2p network',
 'live streaming']

Using covering can yield other keywords of interest.

[50]:
gismo.get_features_by_coverage(k=10)
[50]:
['p2p',
 'grid',
 'byzantine',
 'preference',
 'pagerank',
 'fun',
 'reloaded',
 'p2p network',
 'old',
 'acyclic']

How are keywords structured?

[51]:
gismo.get_features_by_cluster(k=20, resolution=.7)
 F: 0.39. R: 0.21. S: 0.67.
- F: 0.72. R: 0.09. S: 0.31.
-- F: 0.72. R: 0.06. S: 0.31.
--- p2p (R: 0.02; S: 0.34)
--- refresh (R: 0.01; S: 0.27)
--- live streaming (R: 0.01; S: 0.29)
--- live (R: 0.01; S: 0.33)
--- streaming (R: 0.01; S: 0.32)
-- reloaded (R: 0.01; S: 0.29)
-- p2p network (R: 0.01; S: 0.29)
-- old (R: 0.01; S: 0.27)
-- acyclic (R: 0.01; S: 0.32)
- grid (R: 0.02; S: 0.47)
- F: 0.66. R: 0.09. S: 0.62.
-- F: 0.79. R: 0.08. S: 0.62.
--- byzantine (R: 0.01; S: 0.56)
--- stabilization (R: 0.01; S: 0.58)
--- self (R: 0.01; S: 0.62)
--- self stabilization (R: 0.01; S: 0.59)
--- stabilize (R: 0.01; S: 0.60)
--- self stabilize (R: 0.01; S: 0.60)
--- asynchronous (R: 0.01; S: 0.56)
-- fun (R: 0.01; S: 0.48)
- preference (R: 0.01; S: 0.24)
- pagerank (R: 0.01; S: 0.24)

Who are the associated researchers?

[52]:
gismo.get_documents_by_rank(k=10)
[52]:
['Sébastien Tixeuil',
 'Fabien Mathieu',
 'Shlomi Dolev',
 'Toshimitsu Masuzawa',
 'Michel Raynal',
 'Nitin H. Vaidya',
 'Stéphane Devismes',
 'Fukuhito Ooshita',
 'Edmond Bianco',
 'Ted Herman']

How are they structured?

[53]:
gismo.get_documents_by_cluster(k=10, resolution=.8)
 F: 0.00. R: 0.01. S: 0.66.
- F: 0.12. R: 0.00. S: 0.66.
-- F: 0.36. R: 0.00. S: 0.52.
--- F: 0.80. R: 0.00. S: 0.50.
---- F: 0.80. R: 0.00. S: 0.52.
----- Sébastien Tixeuil (R: 0.00; S: 0.53)
----- Shlomi Dolev (R: 0.00; S: 0.42)
----- Toshimitsu Masuzawa (R: 0.00; S: 0.47)
----- Stéphane Devismes (R: 0.00; S: 0.41)
----- Fukuhito Ooshita (R: 0.00; S: 0.50)
---- Ted Herman (R: 0.00; S: 0.39)
--- F: 0.82. R: 0.00. S: 0.31.
---- Michel Raynal (R: 0.00; S: 0.37)
---- Nitin H. Vaidya (R: 0.00; S: 0.26)
-- Fabien Mathieu (R: 0.00; S: 0.73)
- Edmond Bianco (R: 0.00; S: 0.04)