Asked By – pocketfullofcheese
I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I would like to be able to merge as long as they are similar to one another.
Any similarity algorithm will do (soundex, Levenshtein, difflib’s).
Say one DataFrame has the following data:
df1 = DataFrame([,,,,], index=['one','two','three','four','five'], columns=['number']) number one 1 two 2 three 3 four 4 five 5 df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter']) letter one a too b three c fours d five e
Then I want to get the resulting DataFrame
number letter one 1 a two 2 b three 3 c four 4 d five 5 e
Now we will see solution for issue: is it possible to do fuzzy match merge with python pandas?
Similar to @locojay suggestion, you can apply
df2‘s index and then apply a
In : import difflib In : difflib.get_close_matches Out: <function difflib.get_close_matches> In : df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)) In : df2 Out: letter one a two b three c four d five e In : df1.join(df2) Out: number letter one 1 a two 2 b three 3 c four 4 d five 5 e
If these were columns, in the same vein you could apply to the column then
df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name']) df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name']) df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])) df1.merge(df2)
This question is answered By – Andy Hayden
This answer is collected from stackoverflow and reviewed by FixPython community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0