My first year as a data scientist, I witnessed myself and others retyping the same lines of code and retracing our work time and time again. Perhaps some of this did not warrant concern.
After all, how long does it take to type the standard imports,
1 |
import pandas as pd
|
1 |
import numpy as np |
1 |
import matplotlib.pyplot as plt |
1 |
%matplotlib inline |
and the like?
Yet there were also plenty of real concerns, as my colleagues and I performed many of the same tasks repeatedly, filling null values, standardizing column names, and creating dummy variables. Shouldn’t we be able to standardize these rote processes and not have to recode the entire preprocessing pipeline every time?
Even worse, sometimes after a day’s worth of exploratory analysis, fruitful insights would surface, only to realize that the Jupyter notebook you’d been working on was a jumbled mess, having jumped around in the notebook repeatedly, fixing errors and rerunning cells. How on earth are you supposed to now repeat that process?
It’s also funny to me that despite proclaiming the immense value of object orientated programming, none of my instructors pointed out how to practically implement such a philosophy into a daily workflow.
I hope this article helps you sidestep the pitfalls many of us have fallen into in order to develop a more productive and sensible workflow.
The key to a more productive workflow is building and organizing reusable code. Whether you prefer a functional or object orientated approach, you don’t want to find yourself retracing the same steps time and time again. Undoubtedly you’ve seen the benefit of writing your own functions and classes. The next step is formalizing this practice and creating an organized workflow.
I regularly use Jupyter Notebooks in my workflow. They are an indispensible tool for exploration and analysis. I also find them very helpful in the development process, allowing me to interactively test and modify code.
That said, don’t forget the value of a good old .py file. It is much easier to call a .py file from the command line, and all of your favorite packages from pandas to seaborn are just collections of .py files.
Similarly, you can write your own functions and classes in .py files and then import them into a Jupyter notebook. For example, you could write a bunch of web scraping functions in a file named scrapingtools.py and import them to a Jupyter notebook with “`import scrapingtools as st“`.
While this may be old news, my experience has shown it is vastly underused.
If the .py file you wrote is not in the current working directory of your python session, you’ll see an error message when trying to import it.
1 2 3 |
import scrapingtools as st #ModuleNotFoundError: No module named 'scrapingtools' |
Not to worry, you just need to tell Python where to look for your newly written module. The Python path tells the interpreter where to look for packages and modules when you call an import statement. You can inspect your current path using the sys package.
1 2 3 |
import sys sys.path |
Moreover you can also temporarily modify this to include additional directories where your relevant .py files are stored:
1 |
sys.path.append(“directory/path”) |
To verify that you have correctly added the directory to the path, you can reinspect the path:
1 |
sys.path |
The last key to really perfecting this workflow is the magic command autoreload.
You may have noticed a previous mention of a magic command, `%matplotlib inline`. Magic commands are, simply put, magical. They extend and update the functionality of Jupyter notebooks in new ways. If you haven’t explored them yet, here are a few of the best magic commands for Jupyter notebook. Additionally, here is the full documentation of built in magic commands.
The `%autoreload` magic command allows you to reimport packages and modules every time a cell that depends on that script is called. This allows you to update a custom script that you are working with in a seperate .py file and use those updated changes your jupyter notebook, live, having already imported said .py file.
Imagine that you are scraping the web with Selenium . You first open up a new Jupyter notebook, import selenium, and start up a new browser session. Right afterwards, you can also create a new blank .py file.
Let’s say you’re scraping Airbnb, so you title the file airbnb.py. From there, you can start prototyping a function to login in the .py file and testing whether said function works by executing it through your Jupyter notebook. This allows a fluent workflow whereby you can develop clean modular code in one file while still having the flexibility and interactivity provided by Jupyter.
First,
1 |
%load_ext autoreload |
Then, run autoreload.
I recommend using `%autoreload 1` and only reloading specified modules, such as your development file which will encapsulate your organized (and reusable) code. Using `%autoreload 1` will only autoreload specified packages, while `%autoreload 2` will cause all packages within the Jupyter notebook to be reimported every time they are called.
By using `%autoreload 1`, you can save some minor overhead time by not reimporting all of your base packages such as pandas, numpy and matplotlib which are not apt to change during your programming session.
Returning to our example, here’s how you could import your custom Python file and ensure that it is reloaded every time it is called from the notebook:
1 2 3 4 5 |
%load_ext autoreload
%autoreload 1
%aimport airbnb
|
To demonstrate, add a line to your airbnb file, `test = 5`. Calling `airbnb.test` from your Jupyter notebook should return 5. Change the value in the Python file again to `test=7` and save the file. Rerunning `airbnb.test` from Jupyter now returns 7, reflecting the changes you made. With this, you can now reap the benefits of Jupyter while still maintaining organized and reusable tools for future use, and seamlessly jumping between the two.
Using git, this workflow also allows for creating meaningful and navigable commits as you gradually update and improve the modular code stored in the .py file. As you add new classes, methods, and functions, these can be reflected in your commit log as well.
Finally, while I think .py files are more modular and well suited to this work then .ipynb files, you can even import other Jupyter notebooks and the functions or classes within them into a separate notebook!
Organizing your workflow and separating reusable code from exploratory analysis is foundational, allowing you to develop a reusable repertoire of self-designed tools. Using the autoreload magic command, you can even use Jupyter as a development environment where you can interactively test this code, while still keeping it in a separate file for clean, clear, modular code. Together, you can have the best of both worlds and maintain a saner, streamlined workflow.
You may not be ready for us now, but you’ll want to remember us when you are. Enter your email to stay updated on the latest in analytics and our services.