Most data problems don't need the horsepower of Spark or Hadoop. If all of your data fits comfortably in memory, Pandas may be a great fit. Pandas is my goto for quickly building out production pipelines that are both efficient and easy to maintain.
One issue I've run into is that pandas doesn't natively log. Fortunately,
pandas allows extending the dataframe API with
accessors.
We have since implemented an accessor for logging in the publicly avilable
pdlog
package. Importing pdlog
registers a LogAccessor
under the .log
namespace, with a collection of
methods that log what they're doing.
import pdlog
df = pd.read_csv("myfile.csv") \
.sort_values("timestamp") \
.log.dropna(subset=["timestamp"]) \
.log.drop_duplicates() \
.log.pivot("timestamp", "feature", "value")
What are you waiting for? Go on. pip install pdlog
! And please don't
hestitate to reach out to
me if you find it helpful or if you have any
questions or comments.