Data Science Digest 5

Title: The Easy Way to Do Advanced Data Visualisation for Data Scientists

Author: George Seif, AI/Machine Learning Engineer, Kdnuggets
Source: kdnuggets.com/2019/08/advanced-data-visualisation-data-scientists.html
How:  Python library Plotly, D3.js
When to use this
: If data visualization isn’t your primary area…and yet you are tasked to provide data visualizations.
Why it’s helpful: Plotly provides interactivity out of the box, versus Matplotlib.
Suggested application: Fancy plots, scatter plots, box plots, heat maps.
Business impact or insights to be gained: Simpler to build with than Matplotlib with interactivity which will be well received by non-data specialist stakeholders.

Title: Version Control for Data Science — Tracking Machine Learning models and datasets

Author: Vipul Jain, The Journal Blog
Source: https://blog.usejournal.com/version-control-for-data-science-tracking-your-machine-learning-models-and-datasets-aaa61f20bb45
How:  https://dvc.org/, detailed installation instructions in a linked blog post. Works on top of GIT. System agnostic – supports GCS/S3/Azure and more.
When to use this
: When you want to control and monitor different versions of large data files like datasets and trained model files, including having the ability to rollback and/or switch among versions.

Why it’s helpful: DVC enhances productivity and eliminates lost time during data processing and creating models to repeat the same state without maintaining manual log.
Suggested application: Tracking Machine Learning models, datasets, and label encodings, etc.
Business impact or insights to be gained: DVC saves time and money, creates efficient workflows. It can reuse and reproduce files fast while managing versions, running simulations, and testing programs.

Title: Stop Using Mean to Fill Missing Data

Author:  Dario Radecic, Towards Data Science
Source: https://towardsdatascience.com/stop-using-mean-to-fill-missing-data-678c0d396e22
How: Multivariate Imputation by Chained Equation (MICE), impyute library through p.
When to use this: When you need to fill missing data.
Why it’s helpful: Fills the missing data numerous times. MICE can efficiently manipulate different types of data, such as continuous and binary. It can create multiple “complete” datasets. It provides more accurate datasets than the Mean Imputation approach.
Suggested application:  When working on predictive models with incomplete data sets, MICE provides higher accuracy than using MEAN values.
Business impact or insights to be gained: Improves the accuracy of datasets. As a result, you can give stakeholders correct data that better describes real-world conditions. This accurate data, in turn, will create improved outcomes.

Don’t forget to subscribe for more tips and tricks to stay on top of Data Science developments!

See what others are saying