History¶
X.X.X (TODO-List)¶
- Rethink distortion on both vectors normalization and IDTF/query trade-off.
- Accelerate similarity computation (currently sklearn-based) in clustering.
0.4.X (2023-0X-XX) (tentative)¶
- Context manager for FileSource (e.g.
with FileSource(...) as source:
) - 3.9 compatibility issues rechecked
- Wheels
- Minor change in test_dblp.py
0.4.3 (2022-12-26)¶
- Refresh dependencies, compatibilities, and such.
- Gismo is tested up to Python 3.10.
- Patch sklearn change of API (ngram_range must be a tuple, get_feature_names has been renamed get_feature_names_out)
- Updates MixInIO logic: you now save with the dump method and load with the load class method.
- Package management now uses Github actions.
0.4.2 (2021-05-05)¶
Minor patch
- Signature of the Gismo rank method changed to allow to enter directly a query vector instead of a string query (useful if one wants to craft a custom query vector).
- Original source of the Reuters 50/50 dataset was discontinued; changed to an alternate source.
- Fix change in spacy API
0.4.1 (2020-11-25)¶
Minor update.
- DBLP API modified to you can specify the set of fields you want to retrieve.
- Minor update in doctests.
- Python 3.9 compatibility added.
0.4.0 (2020-07-21)¶
0.4 is a big update. Lot of things added, lot of things changed.
- New API for Gismo runtime parameters (see new parameters module for details). Short version:
gismo = Gismo(corpus, embedding, alpha=0.85)
: create a gismo with damping factor set to 0.85 instead of default value.gismo.parameters.alpha = 0.85
: set the damping factor of the gismo to 0.85.gismo.rank(query, alpha=0.85)
: makes a query with damping factor temporarily set to 0.85.
- Landmarks! Half Corpus, half Gismo, the Landmarks class can simplify many analysis tasks.
- Landmarks are (small) corpus where each entry is augmented with the computation of an associated gismo query;
- Landmarks can be used to refine the analysis around a part of your data;
- They can be used as soft and fast classifiers.
- Landmarks’ runtime parameters follow the same approach than for Gismo instances (cf above).
- See the dedicated tutorial to learn more!
- Documentation summer cleaning.
query_distortion
parameter (reshape subspace for clustering) is renameddistortion
and is now a float instead of a bool (e.g. you can apply distortion in a non-binary way).- Full refactoring of get_*** and post_*** methods and objects.
- The good news is that they are now more natural, self-describing, and unified.
- The bad news is that there is no backward-compatibility with previous Gismo versions. Hopefully this refactoring will last for some time!
- Gismo logo added!
0.3.1 (2020-06-12)¶
- New dataset: Reuters C50
- New module: sentencizer
0.3.0 (2020-05-13)¶
- dblp module: url2source function added to directly load a small dblp source in memory instead of using a FileSource approach.
- Possibility to disable query distortion in gismo.
- XGismo class to cross analyze embeddings.
- Tutorials updated
0.2.5 (2020-05-11)¶
- auto_k feature: if not specified, a query-dependent, reasonable, number of results k is estimated.
- covering methods added to gismo. It is now possible to use get_covering_* instead of get_ranked_* to maximize coverage and/or eliminate redundancy.
0.2.4 (2020-05-07)¶
- Tutorials for ACM and DBLP added. After cleaning, there is currently 3 tutorials:
- Toy model, to get the hang of Gismo on a tiny example,
- ACM, to play with Gismo on a small example,
- DBLP, to play with a large dataset.
0.2.3 (2020-05-04)¶
- ACM and DBLP dataset creation added.
0.2.2 (2020-05-04)¶
- Notebook tutorials added (early version)
0.2.1 (2020-05-03)¶
- Actual code
- Coverage badge
0.1.0 (2020-04-30)¶
- First release on PyPI.