Fix Python – Python, remove all non-alphabet chars from string

I am writing a python MapReduce word count program. Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything but alphanumeric chars from a string in Python which shows a nice solution using regex, but I am not sure how to implement it
def mapfn(k, v):
print v
import re, string ….

Fix Python – Handling backreferences to capturing groups in re.sub replacement pattern

I want to take the string 0.71331, 52.25378 and return 0.71331,52.25378 – i.e. just look for a digit, a comma, a space and a digit, and strip out the space.
This is my current code:
coords = ‘0.71331, 52.25378’
coord_re = re.sub(“(\d), (\d)”, “\1,\2”, coords)
print coord_re

But this gives me 0.7133,2.25378. What am I doing wrong?
….

Fix Python – Do regular expressions from the re module support word boundaries (\b)?

While trying to learn a little more about regular expressions, a tutorial suggested that you can use the \b to match a word boundary. However, the following snippet in the Python interpreter does not work as expected:
>>> x = ‘one two three’
>>> y = re.search(“\btwo\b”, x)

It should have been a match object if anything was matched, but it is None….

Fix Python – Return string with first match for a regex, handling case where there is no match

I want to get the first match of a regex.
In this case, I have a list:
text = ‘aa33bbb44’
re.findall(‘\d+’,text)

[’33’, ’44’]

I could extract the first element of the list:
text = ‘aa33bbb44’
re.findall(‘\d+’,text)[0]

’33’

But that only works if there is at least one match, otherwise I’ll get an error:
text = ‘aazzzbbb’
re.findall(‘\d+’,text….

Fix Python – Split a string at uppercase letters

What is the pythonic way to split a string before the occurrences of a given set of characters?
For example, I want to split
‘TheLongAndWindingRoad’
at any occurrence of an uppercase letter (possibly except the first), and obtain
[‘The’, ‘Long’, ‘And’, ‘Winding’, ‘Road’].
Edit: It should also split single occurrences, i.e.
from ‘ABC’ I’d like to….

Fix Python – How to use regex to find all overlapping matches

I’m trying to find every 10 digit series of numbers within a larger series of numbers using re in Python 2.6.
I’m easily able to grab no overlapping matches, but I want every match in the number series. Eg.
in “123456789123456789”
I should get the following list:
[1234567891,2345678912,3456789123,4567891234,5678912345,6789123456,7891234567,891234….

Fix Python – Regular expression matching a multiline block of text

I’m having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is (‘\n’ is a newline)
some Varying TEXT\n
\n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
[more of the above, ending with a newline]\n
[yep, there is a variable number of lines here]\n
\n
(repeat the above a few hundred times….

Fix Python – Strip / trim all strings of a dataframe

Cleaning the values of a multitype data frame in python/pandas, I want to trim the strings. I am currently doing it in two instructions :
import pandas as pd

df = pd.DataFrame([[‘ a ‘, 10], [‘ c ‘, 5]])

df.replace(‘^\s+’, ”, regex=True, inplace=True) #front
df.replace(‘\s+$’, ”, regex=True, inplace=True) #end

df.values

This is quite slow….