Introducing pdlog
Most data problems don’t need the horsepower of Spark or Hadoop. If all of your data fits comfortably in memory, pandas may be a great fit. pandas is my goto for quickly building out production pipelines that are both efficient and easy to maintain.
One issue I’ve run into is that pandas doesn’t natively log. Fortunately, it allows extending the dataframe API with accessors. We have since implemented an accessor for logging in the publicly avilable pdlog package.
To get started:
Install
pdlog:pip install pdlogImport
pdlogin your application:import pdlogAdd
.logbefore your method calls:df = df.log.dropna()They’ll now log useful information about the operation, for example:
2020-05-26 20:55:30,049 INFO <pdlog> dropna: dropped 1 row (17%), 5 rows remaining
It works by registering a custom LogAccessor under the .log namespace on import. The accessor containes a collection of wrapper methods that log what they’re doing.