Introducing pdlog
Most data problems don’t need the horsepower of Spark or Hadoop. If all of your data fits comfortably in memory, pandas may be a great fit. pandas is my goto for quickly building out production pipelines that are both efficient and easy to maintain.
One issue I’ve run into is that pandas doesn’t natively log. Fortunately, it allows extending the dataframe API with accessors. We have since implemented an accessor for logging in the publicly avilable pdlog
package.
To get started:
Install
pdlog
:pip install pdlog
Import
pdlog
in your application:import pdlog
Add
.log
before your method calls:= df.log.dropna() df
They’ll now log useful information about the operation, for example:
2020-05-26 20:55:30,049 INFO <pdlog> dropna: dropped 1 row (17%), 5 rows remaining
It works by registering a custom LogAccessor
under the .log
namespace on import. The accessor containes a collection of wrapper methods that log what they’re doing.