Most data problems don't need the horsepower of Spark or Hadoop. If all of your data fits comfortably in memory, Pandas may be a great fit. Pandas is my goto for quickly building out production pipelines that are both efficient and easy to maintain.
One issue I've run into is that pandas doesn't natively log. Fortunately,
pandas allows extending the dataframe API with
We have since implemented an accessor for logging in the publicly avilable
pdlog package. Importing
LogAccessor under the
.log namespace, with a collection of
methods that log what they're doing.
import pdlog df = pd.read_csv("myfile.csv") \ .sort_values("timestamp") \ .log.dropna(subset=["timestamp"]) \ .log.drop_duplicates() \ .log.pivot("timestamp", "feature", "value")
What are you waiting for? Go on.
pip install pdlog! And please don't
hestitate to reach out to
me if you find it helpful or if you have any
questions or comments.