Data Science Digest 8

Keeping up is hard for data scientists to do. Chisel Analytics is happy to help!

Title: Karate Club consists of state-of-the-art methods to do unsupervised learning on graph structured data

Source: https://github.com/benedekrozemberczki/karateclub and https://karateclub.readthedocs.io/en/latest/notes/introduction.html
How: GitHub installation and documentation for data handling, full list of implemented methods, and datasets.
When to use this: When you need to perform “small-scale graph mining research. First, it provides network embedding techniques at the node and graph level. Second, it includes a variety of overlapping and non-overlapping community detection methods.”
Why it’s helpful: Incorporates Overlapping Community Detection, Non-Overlapping Community Detection, Neighborhood-Based Node Level Embedding, Structural Node Level Embedding, Attributed Node Level Embedding, and Graph Level Embedding.
Suggested application: Use the clusterings and embeddings for downstream learning. Use case examples include: how well Facebook page clusters and group memberships are aligned, abuse of the platform Twitch, classification of threads on Reddit.
Business impact or insights to be gained: “Only quick and minimal changes to the code are needed when a model performs poorly.”

Title: Discovering millions of datasets on the web

Source: https://blog.google/products/search/discovering-millions-datasets-web/
How: Go to https://datasetsearch.research.google.com/ to search for tables, images or text datasets (or submit your dataset for inclusion)
When to use this: When looking for industry, geographical, financial, etc. data plus the source links.
Why it’s helpful: With almost 25 million indexed datasets, the data you need is likely available.
Suggested application: Building models for company or product/service expansion, context for interpreting in-house data against external sources, scientific research, US government data tables easily searchable.
Business impact or insights to be gained: Easy way to search and access real data sets and/or to share your own for search and research purposes.

Title: VizSeq: A visual analysis toolkit for accelerating text generation research (Facebook OpenSource)

Source: https://ai.facebook.com/blog/vizseq-a-visual-analysis-toolkit-for-accelerating-text-generation-research/
How: GitHub: https://github.com/facebookresearch/vizseq. VizSeq requires Python 3.6+ and currently runs on Unix/Linux and macOS/OS X. It will support Windows as well in the future.
Why it’s helpful: “A Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation and video description. It takes multi-modal sources, text references as well as text predictions as inputs, and analyzes them visually in Jupyter Notebook or a built-in Web App (the former has Fairseq integration). VizSeq also provides a collection of multi-process scorers as a normal Python package.”.
Suggested application: “Visualizes text generation outputs where you can filter, sort, and inspect examples with multimodal data, highlighted differences, and various metrics displayed all in one place.” It allows users to explore data set characteristics and compare models holistically under various metric.
Business impact or insights to be gained: “Performs speedy evaluation on large data sets (multiprocess accelerated) covering a wide collection of metrics: BLEU, NIST, METEOR, TER, RIBES, chrF, GLEU, ROUGE, CIDEr, WER, LASER and BERTScore. It also has a simple API to help define new metrics.” “Existing open source analysis tools often lack the functionality integration and optimization for productivity and scalability.”

Be sure to subscribe to this blog for more tips. And sign up for our free tools to learn of job opportunities and more.

See what others are saying