Question
Asked By – Jan Tojnar
How can I remove all characters except numbers from string?
Now we will see solution for issue: Remove characters except digits from string using Python?
Answer
In Python 2.*, by far the fastest approach is the .translate
method:
>>> x='aaa12333bb445bb54b5b52'
>>> import string
>>> all=string.maketrans('','')
>>> nodigs=all.translate(all, string.digits)
>>> x.translate(all, nodigs)
'1233344554552'
>>>
string.maketrans
makes a translation table (a string of length 256) which in this case is the same as ''.join(chr(x) for x in range(256))
(just faster to make;-). .translate
applies the translation table (which here is irrelevant since all
essentially means identity) AND deletes characters present in the second argument — the key part.
.translate
works very differently on Unicode strings (and strings in Python 3 — I do wish questions specified which major-release of Python is of interest!) — not quite this simple, not quite this fast, though still quite usable.
Back to 2.*, the performance difference is impressive…:
$ python -mtimeit -s'import string; all=string.maketrans("", ""); nodig=all.translate(all, string.digits); x="aaa12333bb445bb54b5b52"' 'x.translate(all, nodig)'
1000000 loops, best of 3: 1.04 usec per loop
$ python -mtimeit -s'import re; x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 7.9 usec per loop
Speeding things up by 7-8 times is hardly peanuts, so the translate
method is well worth knowing and using. The other popular non-RE approach…:
$ python -mtimeit -s'x="aaa12333bb445bb54b5b52"' '"".join(i for i in x if i.isdigit())'
100000 loops, best of 3: 11.5 usec per loop
is 50% slower than RE, so the .translate
approach beats it by over an order of magnitude.
In Python 3, or for Unicode, you need to pass .translate
a mapping (with ordinals, not characters directly, as keys) that returns None
for what you want to delete. Here’s a convenient way to express this for deletion of “everything but” a few characters:
import string
class Del:
def __init__(self, keep=string.digits):
self.comp = dict((ord(c),c) for c in keep)
def __getitem__(self, k):
return self.comp.get(k)
DD = Del()
x='aaa12333bb445bb54b5b52'
x.translate(DD)
also emits '1233344554552'
. However, putting this in xx.py we have…:
$ python3.1 -mtimeit -s'import re; x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 8.43 usec per loop
$ python3.1 -mtimeit -s'import xx; x="aaa12333bb445bb54b5b52"' 'x.translate(xx.DD)'
10000 loops, best of 3: 24.3 usec per loop
…which shows the performance advantage disappears, for this kind of “deletion” tasks, and becomes a performance decrease.
This question is answered By – Alex Martelli
This answer is collected from stackoverflow and reviewed by FixPython community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0