Pandas Snippets that I found delectable

Pandas Snippets that I found delectable

Picture this:

I The age of data is upon us. You have your database tables set up and ready to go. Except for one. A fellow engineer returns from his dig, looking for the new oil. Exhausted, he dumps tens of thousands of rows in your bucket; "Here's what I could scrounge up. There are a couple of missing columns, though. So keep that in mind as you ingest it". A sliver of unease slips into your mind at that last part; "Thanks" you say, as you rev up a VM to begin the process...

Pandas is a python library used for data manipulation and analysis. In my efforts to get data into a database, I've found the library much better at preparing that data for ingestion, rather than vanilla python, let's say. Here are bits and slices of code I've found helpful in that regard.

II The VM lights green, ready to be ssh'd into. You consider pulling up your editor and using it as your means of connecting to the VM, which pulls you into the thought of connecting just to the notebook in particular that you need. The web says that it's possible, and so you keep that tidbit for later. Everything is ready, and you're looking at the columns in the dataframe. What are you going to do about the missing ones...

df["new_column"] = pd.Series(
    np.random.randint(low=10, high=90, size=len(df) - 1),
    index=[x for x in range(len(df) - 1)]
)

III There are two other missing columns though. You see that their values can be retrieved from another set of rows you have. They exist in a list you created for this very purpose...

"""
Create list of lat/lng points from a polygons geojson
"""
gdf = gpd.GeoDataFrame.from_features(polygons_geojson)
gdf["latlng"] = gdf["geometry"].centroid
gdf["lat"] = [x for x in gdf["latlng"].x]
gdf["lng"] = [y for y in gdf["latlng"].y]
gdf.drop("latlng", axis=1, inplace=True)

centroid_existing_fields = json.loads(gdf.to_json())
"""
Attach the missing columns
CAUTION: There's an On^2 here so tread lightly
"""
for col in ["lat", "lng"]:
    df[col] = df.apply(
        lambda x: next(filter(
            lambda y: y["properties"]["field_id"] == x["field_id"],
            centroid_existing_fields["features"]
        ))["properties"][col], axis=1
    )

IV The extra columns have been catered to for the time being. Everything is in order. In order. Are the columns arranged right? Are they of the same type as those in the database? No, there's a datetime column in the DB. And no, a few of the columns are not in their right positions.

df["month"] = year + "-" + df['month'] + "-01"
df = df.rename(columns={"month": "date_column"})
df["date_column"] = pd.to_datetime(df["date_column"])

"""
Rearrange the columns
"""
df = df["col1, date_column, col3, col4".split(",")]

V You try a copy command of the rows into the particular table, but there are rows with null valued cells. They were not in the head. Or tail. So you need to find them, log them and then drop them. We're working on a deadline you think, and need to have the front-end ready at least, as the remaining pieces are hunted down by your fellow engineer.

rows_without_empty_cells = df[~(df.isna().any(axis=1))]
rows_with_empty_cells = df[df.isna().any(axis=1)]

As of this article's writing, pandas (pd) v1.3.0 and numpy (np) v1.21 are the current releases.

Photo by Joshua J. Cotten on Unsplash