Introducing pdlog

Seamless logging for pandas operations.
Author

Wasim Lorgat

Published

May 29, 2020

Most data problems don’t need the horsepower of Spark or Hadoop. If all of your data fits comfortably in memory, pandas may be a great fit. pandas is my goto for quickly building out production pipelines that are both efficient and easy to maintain.

One issue I’ve run into is that pandas doesn’t natively log. Fortunately, it allows extending the dataframe API with accessors. We have since implemented an accessor for logging in the publicly avilable pdlog package.

To get started:

  1. Install pdlog:

    pip install pdlog
  2. Import pdlog in your application:

    import pdlog
  3. Add .log before your method calls:

    df = df.log.dropna()

    They’ll now log useful information about the operation, for example:

    2020-05-26 20:55:30,049 INFO <pdlog> dropna: dropped 1 row (17%), 5 rows remaining

It works by registering a custom LogAccessor under the .log namespace on import. The accessor containes a collection of wrapper methods that log what they’re doing.