Fix Python – Pandas: convert categories to numbers

Question

Asked By – sachinruk

Suppose I have a dataframe with countries that goes as:

cc | temp
US | 37.0
CA | 12.0
US | 35.0
AU | 20.0

I know that there is a pd.get_dummies function to convert the countries to ‘one-hot encodings’. However, I wish to convert them to indices instead such that I will get cc_index = [1,2,1,3] instead.

I’m assuming that there is a faster way than using the get_dummies along with a numpy where clause as shown below:

[np.where(x) for x in df.cc.get_dummies().values]

This is somewhat easier to do in R using ‘factors’ so I’m hoping pandas has something similar.

Now we will see solution for issue: Pandas: convert categories to numbers


Answer

First, change the type of the column:

df.cc = pd.Categorical(df.cc)

Now the data look similar but are stored categorically. To capture the category codes:

df['code'] = df.cc.cat.codes

Now you have:

   cc  temp  code
0  US  37.0     2
1  CA  12.0     1
2  US  35.0     2
3  AU  20.0     0

If you don’t want to modify your DataFrame but simply get the codes:

df.cc.astype('category').cat.codes

Or use the categorical column as an index:

df2 = pd.DataFrame(df.temp)
df2.index = pd.CategoricalIndex(df.cc)

This question is answered By – John Zwinck

This answer is collected from stackoverflow and reviewed by FixPython community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0