Asked By – KDecker
I am writing a python MapReduce word count program. Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything but alphanumeric chars from a string in Python which shows a nice solution using regex, but I am not sure how to implement it
def mapfn(k, v): print v import re, string pattern = re.compile('[\W_]+') v = pattern.match(v) print v for w in v.split(): yield w, 1
I’m afraid I am not sure how to use the library
re or even regex for that matter. I am not sure how to apply the regex pattern to the incoming string (line of a book)
v properly to retrieve the new line without any non-alphanumeric chars.
Now we will see solution for issue: Python, remove all non-alphabet chars from string
import re regex = re.compile('[^a-zA-Z]') #First parameter is the replacement, second parameter is your input string regex.sub('', 'ab3d*E') #Out: 'abdE'
Alternatively, if you only want to remove a certain set of characters (as an apostrophe might be okay in your input…)
regex = re.compile('[,\.!?]') #etc.