Not an avid Pandas or Numpy user myself, but I had to spend some time lately to optimise probably a fairly common process: looking up a nested dictionary (2 or more levels) to find the right values element-wise for a column in a Pandas DataFrame. If it’s not clear, the problem I’m trying to solve here is to optimise a look-up function that can be done with .apply()
to something more performant.
You might say, why not using .map()
? Because the look-up function is not y = f(x)
, no, it is more like y = f(x, a)
or even y = f(x, a, b)
, depending on the level of nestedness.
As mentioned earlier, this can be implemented with .apply()
by supplying a Python function that does the look-up. However, .apply()
is very slow (it’s not vectorised). The solution here is actually straightforward (I’m very new to Pandas and it took me some time to get here so I decided to make a note here for this). We first flatten the nested dictionary to have different levels of keys as tuples, which allows us to create a DataFrame
with MultiIndex
. With MultiIndex
, we can easily apply .merge
to join the DataFrame
objects.
Hopefully the code snippet is more understandable.
import pandas as pd
nested_dict = {
"A": {
"Apple": "Red",
"Banana": "Green",
},
"B": {"Apple": "Green", "Banana": "Yellow"},
}
df = pd.DataFrame.from_dict(
{
"Fruit": {0: "Apple", 1: "Banana", 2: "Banana"},
"Price": {0: 0.911, 1: 1.734, 2: 1.844},
"Bucket": {0: "A", 1: "B", 2: "A"},
}
)
# Method 1: .apply()
# Apply Python function element-wise, as slow as a regular for loop
df1 = df.copy()
df1["Color"] = df1.apply(
lambda row: nested_dict.get(row["Bucket"], {}).get(row["Fruit"]), axis=1
)
print(df1)
# Method 2: .merge()
# Vectorized, much faster (even though the big O is the same)
df2 = df.copy()
# The only overhead is to construct a dataframe with 'MultiIndex'
nested_df = pd.DataFrame.from_dict(
{
(inner_key, outer_key): value
for outer_key, outer_value in nested_dict.items()
for inner_key, value in outer_value.items()
},
orient="index",
)
nested_df.index = pd.MultiIndex.from_tuples(nested_df.index)
nested_df.rename(columns={0: "Color"}, inplace=True)
df2 = df2.merge(nested_df, how="left", left_on=("Fruit", "Bucket"), right_index=True)
print(df2)