Introducing pdlog

May 29, 2020 • 1 min read

Most data problems don't need the horsepower of Spark or Hadoop. If all of your data fits comfortably in memory, Pandas may be a great fit. Pandas is my goto for quickly building out production pipelines that are both efficient and easy to maintain.

One issue I've run into is that pandas doesn't natively log. Fortunately, pandas allows extending the dataframe API with accessors. We have since implemented an accessor for logging in the publicly avilable pdlog package. Importing pdlog registers a LogAccessor under the .log namespace, with a collection of methods that log what they're doing.

import pdlog

df = pd.read_csv("myfile.csv") \
       .sort_values("timestamp") \
       .log.dropna(subset=["timestamp"]) \
       .log.drop_duplicates() \
       .log.pivot("timestamp", "feature", "value")

What are you waiting for? Go on. pip install pdlog! And please don't hestitate to reach out to me if you find it helpful or if you have any questions or comments.