Data Science Digest 6

Title: The 5 Most Useful Techniques to Handle Imbalanced Datasets

Author: Rahul Agarwal, Senior Statistical Analyst at Walmart Labs
Source: https://www.kdnuggets.com/2020/01/5-most-useful-techniques-handle-imbalanced-datasets.html
How: resampling, imbalanced-learn(imblearn); Tomek Links, SMOTE (Synthetic Minority Oversampling Technique); sklearn,
When to use this: At the occurrence of imbalanced datasets, that is, when “you have such a small sample for the positive class in your dataset that the model is unable to learn”.
Why it’s helpful: Address the problem of an imbalanced dataset: Random undersampling and oversampling, Undersampling and Oversampling using imbalanced-learn, Class weights in the models, and Change your Evaluation Metric.
Suggested application: Finance, marketing/ ad serving, transportation/ airline, medical, content moderation, etc.
Business impact or insights to be gained: Imbalanced datasets “fail to capture the minority class, which is most often the point of creating the model in the first place.” Thus, analysis might overlook fraudulent bank transactions, identifying whether a patient has a rare disease, the faulty structural integrity of aircraft, etc.

Title: Visualize interaction effects in regression models

Author: Rick Wicklin
Source: https://blogs.sas.com/content/iml/2019/05/30/visualize-interaction-effects-regression.html
How: Use the EFFECTPLOT statement in SAS
When to use this: Both regressors are continuous, One regressor is categorical and the other is continuous, Both regressors are categorical.
Why it’s helpful: “If you use the EFFECTPLOT statement inside a regression procedure, you can overlay the model on the observed responses. In PROC PLM, the EFFECTPLOT statement visualizes only the model.” Also, you can automate many of these processes and slices.
Suggested application: To visualize the interaction between regressors in a regression model, Changing the slicing levels for continuous variables.
Business impact or insights to be gained: BI regularly uses regression models to make predictions based on independent variables. Being able to visualize the interactions among multiple regressors can help illuminate insights otherwise lost if looking at models independently.

Title: Supercharge your research: a ten-week plan for open data science

Author:  Julia S. Stewart Lowndes, Halley E. Froehlich, Allison Horst, Nishad Jayasundara, Malin L. Pinsky, Adrian C. Stier, Nina O. Therkildsen & Chelsea L. Wood
Source
: https://www.nature.com/articles/d41586-019-03335-4
How: Create workflows that facilitate reproducibility and data sharing, and that streamline code organization and collaboration; all centered around an ‘open’ ethos. 10 week actionable plan.
When to use this: When you need “a sustainable approach to establish more responsible data practices in research groups.”
Why it’s helpful: Researchers may lack formal training in data and open science, leaving data scientists having to recreate the wheel and work in isolation.
Suggested application: Universities, Associations, Research Departments, Cross-Functional Business Units.
Business impact or insights to be gained: Proven process to get alignment and support skills and culture shift toward a unified data program.

Don’t forget to subscribe for more tips and tricks to stay on top of Data Science developments!

See what others are saying