This blog post discusses how to extract digits from a string (column) using regular expressions, which are imported with import re
in Python. The post introduces a function called extract_digits
, which takes a string as input and extracts all digits from it using a regular expression pattern. The pattern used in this function is \d+
, which matches one or more consecutive digits (\d
matches any digit, and +
specifies that one or more occurrences should be matched).
The post provides a step-by-step explanation of how the function works, including how it applies the regular expression pattern to the input string using the re.findall()
function, which finds all non-overlapping matches of the pattern in the string and returns them as a list of strings. Finally, it joins all the strings in the list into a single string using the .join()
method, which concatenates them together with an empty string as a separator.
Overall, this post provides a clear and concise explanation of how to use regular expressions to extract digits from a string (column) in Python, making it a helpful resource for anyone looking to perform this task.
How to extract digits from a string(column) using regular expressions¶
#data manipulation and analysis library.
import numpy as np
import pandas as pd
# For regular expressions
import re
The function (extract_digits) uses regular expressions (imported with import re) to extract all digits from a given string s and return them as a single string.
- It first defines a regular expression pattern r’\d+’ which matches one or more consecutive digits (\d matches any digit, and + specifies that one or more occurrences should be matched).
- \d: Matches any digit character, i.e., the numbers 0 to 9. +: Specifies that one or more occurrences of the preceding pattern should be matched. In this case, the preceding pattern is \d, so the pattern \d+ matches one or more consecutive digits.
- It then applies this pattern to the input string s using the re.findall() function, which finds all non-overlapping matches of the pattern in the string and returns them as a list of strings.
- Finally, it joins all the strings in the list into a single string using the .join() method, which concatenates them together with an empty string as a separator.
# Define a function to extract only digits from a string
def extract_digits(s):
pattern = r'\d+'
matches = re.findall(pattern, s)
return (''.join(matches))
# Example string
s = "Hi, my name is francis and im 30 years old"
# Extract the digits
digits = extract_digits(s)
digits
'30'
# Example string
s = "[email protected]"
# Extract the digits
digits = extract_digits(s)
digits
'92'
# Example string
s = "I bought 1 pen and 2 notebooks"
# Extract the digits
digits = extract_digits(s)
digits
'12'
# Redefine the function to add space between 2 different digits
def extract_digits(s):
pattern = r'\d+'
matches = re.findall(pattern, s)
return (' '.join(matches))
# Example string
s = "I bought 1 pen and 2 notebooks"
# Extract the digits
digits = extract_digits(s)
digits
'1 2'
Extract digits from a column¶
# Create a sample DataFrame
df = pd.DataFrame({
'Text': ['Im 12 years old', 'abc123def456ghi789', 'what is your name', 'I have 4 apples']
})
# Apply the extract_digits() function to the 'Text' column
df['Digits'] = df['Text'].apply(extract_digits)
df
Text | Digits | |
---|---|---|
0 | Im 12 years old | 12 |
1 | abc123def456ghi789 | 123 456 789 |
2 | what is your name | |
3 | I have 4 apples | 4 |