ACM categories¶

This tutorial shows how ACM categories can be studied with Gismo.

If you have never used Gismo before, you may want to start with the Toy example tutorial.

Imagine that you want to submit an article and are asked to provide an ACM category and some generic keywords. Let see how Gismo can help you.

Here, documents are ACM categories. The features of a category will be the words of its name along with the words of the name of its descendants.

Initialisation¶

First, we load the required package.

[1]:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

from gismo.datasets.acm import get_acm, flatten_acm
from gismo.corpus import Corpus
from gismo.embedding import Embedding
from gismo.gismo import Gismo
from gismo.post_processing import post_features_cluster_print, post_documents_cluster_print

Then, we load the ACM source. Note that we flatten the source, i.e. the existing hierarchy is discarded, as Gismo will provide its own dynamic, query-based, structure.

[2]:

acm = flatten_acm(get_acm())

Each category in the acm list is a dict with name and query. We build a corpus that will tell Gismo that the content of a category is its query value.

[3]:

corpus = Corpus(acm, to_text=lambda x: x['query'])

We build an embedding on top of that corpus. - We set min_df=3 to exclude rare features; - We set ngram_range=(1, 3) to include bi-grams and tri-grams in the embedding. - We manually pick a few common words to exclude from the embedding.

[4]:

vectorizer = CountVectorizer(min_df=3, ngram_range=(1, 3), dtype=float, stop_words=['to', 'and', 'by'])
embedding = Embedding(vectorizer=vectorizer)
embedding.fit_transform(corpus)

[5]:

embedding.x

[5]:

<234x6929 sparse matrix of type '<class 'numpy.float64'>'
        with 28014 stored elements in Compressed Sparse Row format>

We see from embedding.x that the embedding links 234 documents to 6,936 features. There are 28,041 weights: in average, each document is linked to more than 100 features, each feature is linked to 4 documents.

Now, we initiate the gismo object, and customize post_processers to ease the display.

[6]:

gismo = Gismo(corpus, embedding)
gismo.post_documents_item = lambda g, i: g.corpus[i]['name']
gismo.post_documents_cluster = post_documents_cluster_print
gismo.post_features_cluster = post_features_cluster_print

Machine Learning query¶

We perform the query Machine learning. The returned True tells that some of the query features were found in the corpus’ features.

Remark: For this tutorial, we just enter a few words, but at the start of this Notebook, we talked about submitting an article. As a query can be as long as you want, you can call the rank method with the full textual content of your article if you want to.

[7]:

gismo.rank("Machine learning")

[7]:

True

What are the best ACM categories for an article on Machine Learning?

[8]:

gismo.get_documents_by_rank()

[8]:

['Machine learning',
 'Computing methodologies',
 'Machine learning algorithms',
 'Learning paradigms',
 'Machine learning theory',
 'Machine learning approaches',
 'Theory and algorithms for application domains',
 'Theory of computation',
 'Natural language processing',
 'Artificial intelligence',
 'Learning settings',
 'Supervised learning',
 'Reinforcement learning',
 'Education',
 'Dynamic programming for Markov decision processes',
 'Unsupervised learning']

Sounds nice. How are the top 10 domains related in the context of Machine Learning?

[9]:

gismo.get_documents_by_cluster(k=10)

 F: 0.06. R: 0.52. S: 0.75.
- F: 0.63. R: 0.48. S: 0.73.
-- F: 0.78. R: 0.41. S: 0.70.
--- F: 0.98. R: 0.16. S: 0.85.
---- Machine learning (R: 0.09; S: 0.84)
---- Computing methodologies (R: 0.06; S: 0.87)
--- Learning paradigms (R: 0.06; S: 0.62)
--- F: 0.94. R: 0.14. S: 0.63.
---- Machine learning theory (R: 0.06; S: 0.61)
---- Theory and algorithms for application domains (R: 0.05; S: 0.64)
---- Theory of computation (R: 0.04; S: 0.66)
--- Machine learning approaches (R: 0.05; S: 0.54)
-- Machine learning algorithms (R: 0.06; S: 0.60)
- F: 0.66. R: 0.04. S: 0.23.
-- Natural language processing (R: 0.03; S: 0.21)
-- Artificial intelligence (R: 0.02; S: 0.30)

OK! Let’s decode this: - Mainstream we have two main groups - the practical fields (methodology, paradigms) - the theoretical fields - If you don’t want to decide, you can go with approaches/algorithms. - But maybe your article uses machine learning to achieve NLP or AI?

Now, let’s look at the main keywords.

[10]:

gismo.get_features_by_rank()

[10]:

['learning',
 'reinforcement',
 'reinforcement learning',
 'decision',
 'supervised learning',
 'supervised',
 'machine',
 'iteration',
 'learning learning',
 'machine learning',
 'markov decision',
 'markov decision processes',
 'decision processes',
 'dynamic programming',
 'processes',
 'markov',
 'methods',
 'learning multi',
 'multi agent',
 'multi',
 'dynamic',
 'agent']

Let’s organize them.

[11]:

gismo.get_features_by_cluster()

 F: 0.82. R: 0.02. S: 0.95.
- F: 0.89. R: 0.02. S: 0.95.
-- learning (R: 0.00; S: 0.96)
-- reinforcement (R: 0.00; S: 0.83)
-- reinforcement learning (R: 0.00; S: 0.83)
-- decision (R: 0.00; S: 0.96)
-- supervised learning (R: 0.00; S: 0.81)
-- supervised (R: 0.00; S: 0.81)
-- machine (R: 0.00; S: 0.95)
-- iteration (R: 0.00; S: 0.68)
-- machine learning (R: 0.00; S: 0.93)
-- markov decision (R: 0.00; S: 0.89)
-- markov decision processes (R: 0.00; S: 0.89)
-- decision processes (R: 0.00; S: 0.89)
-- dynamic programming (R: 0.00; S: 0.70)
-- processes (R: 0.00; S: 0.89)
-- markov (R: 0.00; S: 0.89)
-- methods (R: 0.00; S: 0.86)
-- learning multi (R: 0.00; S: 0.82)
-- multi agent (R: 0.00; S: 0.82)
-- multi (R: 0.00; S: 0.85)
-- dynamic (R: 0.00; S: 0.71)
-- agent (R: 0.00; S: 0.82)
- learning learning (R: 0.00; S: 0.75)

Hum, not very informative. Let’s increase the resolution to get more structure!

[12]:

gismo.get_features_by_cluster(resolution=.9)

 F: 0.73. R: 0.02. S: 0.95.
- F: 0.79. R: 0.02. S: 0.91.
-- F: 0.94. R: 0.01. S: 0.93.
--- F: 0.96. R: 0.01. S: 0.96.
---- F: 0.96. R: 0.00. S: 0.97.
----- learning (R: 0.00; S: 0.96)
----- decision (R: 0.00; S: 0.96)
---- F: 0.99. R: 0.00. S: 0.94.
----- machine (R: 0.00; S: 0.95)
----- machine learning (R: 0.00; S: 0.93)
--- F: 0.96. R: 0.01. S: 0.89.
---- F: 0.99. R: 0.00. S: 0.89.
----- F: 1.00. R: 0.00. S: 0.89.
------ markov decision (R: 0.00; S: 0.89)
------ markov decision processes (R: 0.00; S: 0.89)
------ decision processes (R: 0.00; S: 0.89)
------ markov (R: 0.00; S: 0.89)
----- processes (R: 0.00; S: 0.89)
---- methods (R: 0.00; S: 0.86)
-- F: 0.99. R: 0.00. S: 0.69.
--- iteration (R: 0.00; S: 0.68)
--- dynamic programming (R: 0.00; S: 0.70)
--- dynamic (R: 0.00; S: 0.71)
-- learning learning (R: 0.00; S: 0.75)
- F: 0.95. R: 0.01. S: 0.85.
-- F: 0.99. R: 0.00. S: 0.83.
--- F: 1.00. R: 0.00. S: 0.83.
---- reinforcement (R: 0.00; S: 0.83)
---- reinforcement learning (R: 0.00; S: 0.83)
--- multi (R: 0.00; S: 0.85)
-- F: 0.97. R: 0.00. S: 0.82.
--- F: 1.00. R: 0.00. S: 0.81.
---- supervised learning (R: 0.00; S: 0.81)
---- supervised (R: 0.00; S: 0.81)
--- learning multi (R: 0.00; S: 0.82)
-- F: 1.00. R: 0.00. S: 0.82.
--- multi agent (R: 0.00; S: 0.82)
--- agent (R: 0.00; S: 0.82)

Rough analysis: - Machine learning is about… Machine learning, which seems related to decision. Markov decision process and dynamic programming seem to matter. - Reinforcement learning and supervised learning seem to be special categories of interest. Seems that multi-agents are involved.

P2P query¶

We perform the query P2P. The returned False tells that P2P is not a feature of the corpus (it’s a small corpus after all, made only of catagory titles).

[13]:

gismo.rank("P2P")

[13]:

False

Let’s try to avoid the acronym. Ok, now it works.

[14]:

gismo.rank("Peer-to-peer")

[14]:

True

What are the best ACM categories for an article on P2P?

[15]:

gismo.get_documents_by_rank()

[15]:

['Network protocols',
 'Distributed architectures',
 'Networks',
 'Network types',
 'Search engine architectures and scalability',
 'Software architectures',
 'Software system structures',
 'Architectures',
 'Computer systems organization',
 'Software organization and properties',
 'Information retrieval',
 'Software and its engineering',
 'Information systems']

Sounds nice. How are these domains related in the context of P2P?

[16]:

gismo.get_documents_by_cluster()

 F: 0.34. R: 0.59. S: 0.79.
- F: 0.90. R: 0.12. S: 0.60.
-- Network protocols (R: 0.06; S: 0.58)
-- Networks (R: 0.06; S: 0.71)
- F: 0.59. R: 0.47. S: 0.69.
-- F: 0.84. R: 0.31. S: 0.66.
--- F: 0.89. R: 0.15. S: 0.64.
---- Distributed architectures (R: 0.06; S: 0.62)
---- Architectures (R: 0.04; S: 0.64)
---- Computer systems organization (R: 0.04; S: 0.67)
--- F: 0.89. R: 0.16. S: 0.62.
---- Software architectures (R: 0.05; S: 0.56)
---- Software system structures (R: 0.05; S: 0.65)
---- Software organization and properties (R: 0.04; S: 0.68)
---- Software and its engineering (R: 0.03; S: 0.70)
-- Network types (R: 0.05; S: 0.52)
-- F: 0.80. R: 0.11. S: 0.52.
--- F: 0.95. R: 0.09. S: 0.50.
---- Search engine architectures and scalability (R: 0.05; S: 0.50)
---- Information retrieval (R: 0.04; S: 0.52)
--- Information systems (R: 0.02; S: 0.65)

OK! Let’s decode this. P2P relates to: - Network protocols - Architectures, with two main groups - the design fields (distributed architecture, organization) - the implementation fields (software) - Inside architectures, but a little bit isolated, search engine architectures and scalability + Information retrieval / systems calls for the scalable property of P2P networks. Specifically, a P2P expert will recognize Distributed Hash Tables, one of the main theoretical and practical success of P2P.

Now, let’s look at the main keywords.

[17]:

gismo.get_features_by_rank(k=10)

[17]:

['peer',
 'protocols',
 'protocol',
 'peer peer',
 'architectures',
 'network',
 'link',
 'architectures tier',
 'architectures tier architectures',
 'tier architectures']

Let’s organize them.

[18]:

gismo.get_features_by_cluster(k=10)

 F: 0.63. R: 0.03. S: 0.92.
- F: 1.00. R: 0.01. S: 0.97.
-- peer (R: 0.01; S: 0.97)
-- peer peer (R: 0.00; S: 0.97)
- F: 0.84. R: 0.01. S: 0.57.
-- F: 1.00. R: 0.01. S: 0.49.
--- protocols (R: 0.00; S: 0.48)
--- protocol (R: 0.00; S: 0.49)
-- network (R: 0.00; S: 0.62)
-- link (R: 0.00; S: 0.61)
- F: 0.95. R: 0.01. S: 0.69.
-- architectures (R: 0.00; S: 0.79)
-- architectures tier (R: 0.00; S: 0.67)
-- architectures tier architectures (R: 0.00; S: 0.67)
-- tier architectures (R: 0.00; S: 0.67)

Rough analysis: - One cluster about network protocols - One cluster about architectures

PageRank query¶

We perform the query PageRank. The returned False tells that PageRank is not a feature of the corpus (it’s a small corpus after all, made only of catagory titles).

[19]:

gismo.rank("Pagerank")

[19]:

False

Let’s try to avoid the copyright infrigment. Ok, now it works.

[20]:

gismo.rank("ranking the web")

[20]:

True

What are the best ACM categories for an article on PageRank?

[21]:

gismo.get_documents_by_rank()

[21]:

['Web searching and information discovery',
 'World Wide Web',
 'Information systems',
 'Web applications',
 'Supervised learning',
 'Retrieval models and ranking',
 'Learning paradigms',
 'Information retrieval',
 'Machine learning',
 'Web mining',
 'Web services',
 'Web data description languages',
 'Computing methodologies',
 'Security and privacy',
 'Internet communications tools',
 'Networks',
 'Software and application security',
 'Network security',
 'Specialized information retrieval',
 'Network types',
 'Interaction paradigms',
 'Middleware for databases',
 'Network properties',
 'Human computer interaction (HCI)']

Sounds nice. How are these domains related in the context of PageRank?

[22]:

gismo.get_documents_by_cluster(k=10)

 F: 0.13. R: 0.43. S: 0.78.
- F: 0.45. R: 0.27. S: 0.68.
-- F: 0.81. R: 0.21. S: 0.70.
--- Web searching and information discovery (R: 0.08; S: 0.62)
--- F: 0.91. R: 0.13. S: 0.81.
---- World Wide Web (R: 0.08; S: 0.79)
---- Information systems (R: 0.05; S: 0.85)
-- Web applications (R: 0.05; S: 0.49)
-- Web mining (R: 0.02; S: 0.34)
- F: 0.27. R: 0.16. S: 0.48.
-- F: 0.92. R: 0.09. S: 0.42.
--- Supervised learning (R: 0.04; S: 0.41)
--- Learning paradigms (R: 0.03; S: 0.43)
--- Machine learning (R: 0.02; S: 0.43)
-- F: 0.81. R: 0.07. S: 0.37.
--- Retrieval models and ranking (R: 0.04; S: 0.34)
--- Information retrieval (R: 0.03; S: 0.45)

Hum, maybe somethin more compact. Let’s lower the resolution (default resolution is 0.7).

[23]:

gismo.get_documents_by_cluster(k=10, resolution=.6)

 F: 0.13. R: 0.43. S: 0.78.
- F: 0.45. R: 0.27. S: 0.68.
-- F: 0.81. R: 0.21. S: 0.70.
--- Web searching and information discovery (R: 0.08; S: 0.62)
--- World Wide Web (R: 0.08; S: 0.79)
--- Information systems (R: 0.05; S: 0.85)
-- Web applications (R: 0.05; S: 0.49)
-- Web mining (R: 0.02; S: 0.34)
- F: 0.27. R: 0.16. S: 0.48.
-- F: 0.92. R: 0.09. S: 0.42.
--- Supervised learning (R: 0.04; S: 0.41)
--- Learning paradigms (R: 0.03; S: 0.43)
--- Machine learning (R: 0.02; S: 0.43)
-- F: 0.81. R: 0.07. S: 0.37.
--- Retrieval models and ranking (R: 0.04; S: 0.34)
--- Information retrieval (R: 0.03; S: 0.45)

Better! Let’s broadly decode this: - One cluster of categories is about the Web & Search - One cluster is about how-to: - learning techniques - information retrieval.

Now, let’s look at the main keywords.

[24]:

gismo.get_features_by_rank()

[24]:

['web',
 'ranking',
 'learning',
 'social',
 'supervised learning',
 'supervised',
 'discovery',
 'security',
 'site',
 'rank',
 'search',
 'learning rank']

Let’s organize them.

[25]:

gismo.get_features_by_cluster()

 F: 0.01. R: 0.02. S: 0.93.
- F: 0.13. R: 0.01. S: 0.93.
-- F: 0.87. R: 0.01. S: 0.86.
--- F: 0.87. R: 0.01. S: 0.85.
---- web (R: 0.00; S: 0.86)
---- ranking (R: 0.00; S: 0.91)
---- social (R: 0.00; S: 0.84)
---- discovery (R: 0.00; S: 0.80)
---- site (R: 0.00; S: 0.77)
--- search (R: 0.00; S: 0.83)
-- F: 0.90. R: 0.01. S: 0.47.
--- learning (R: 0.00; S: 0.44)
--- supervised learning (R: 0.00; S: 0.35)
--- supervised (R: 0.00; S: 0.35)
--- rank (R: 0.00; S: 0.51)
--- learning rank (R: 0.00; S: 0.50)
- security (R: 0.00; S: 0.14)

Rough analysis: - One cluster about the Web - One cluster about learning - One lone wolf: security