List of stringsから正規表現で指定文字列を検索する

Table of Contents

What I Want to Do?
Solution
- なぜre.searchなのか？
- loop methodとの比較
References

What I Want to Do?

word_list = ['Apple', 'Banana', 'Apple-juice', 'ple-ple', 'pleasure', 'Please']
pattern = 'ple'

func(pattern, word_list)
>>> ['Apple', 'Apple-juice', 'ple-ple', 'pleasure']

stringを格納したlistのword_listからpleという文字列を含む要素をリストとして返したい
patternは正規表現での指定も可能

上記の要件を満たす関数を作成したいというのが今回の問題です.

Solution

import re
from itertools import compress

def pygrep(pattern: str, word_list: list):
    list_idx = map(lambda x: bool(re.search(pattern, x)), word_list)
    res = list(compress(word_list, list_idx))
    return res

word_list = ['Apple', 'Banana', 'Apple-juice', 'ple-ple', 'pleasure', 'Please']
pattern = 'ple'

pygrep(pattern, word_list)
>>> ['Apple', 'Apple-juice', 'ple-ple', 'pleasure']

なぜ`re.search`なのか？

マッチングの対象となるPATTERNを用いて, 文字列SOURCEから検索する関数として, reモジュールの re.matchやre.findallといった関数がある.

それぞれ文字列検索関数ですが, 挙動は以下のように差異があります.

関数	挙動
`re.match()`	文字列先頭からのexactマッチング
`re.search()`	文字列先頭からの検索し, 最初のマッチを返す, `contain`に感覚的に近い
`re.findall()`	文字列先頭からの検索し, マッチした文字列を`list`で返す

re.match()はSOURCEの先頭から検索し, 先頭から一致しないとNoneを返す仕様となっています. re.match()とre.search()の挙動の比較配下のようになります.

## SOURCE
string_with_newlines = """something\nsomeotherthing"""


print(re.match('some', string_with_newlines)) # match
print(re.search('some', string_with_newlines)) # match

print(re.match('thing', string_with_newlines)) # won't match
print(re.match('.{0,}thing', string_with_newlines)) # match
print(re.match('.?thing', string_with_newlines)) # won't match
print(re.match('.*thing', string_with_newlines)) # match

print(re.search('thing', string_with_newlines)) # match
print(re.search('.{0,}thing', string_with_newlines)) # match
print(re.search('.?thing', string_with_newlines)) # match
print(re.search('.*thing', string_with_newlines)) # match

print(re.match('someother', string_with_newlines)) # won't match
print(re.match('.{0,}someother', string_with_newlines)) # won't match
print(re.match('.*someother', string_with_newlines)) # won't match

print(re.search('someother', string_with_newlines)) # match
print(re.search('.{0,}someother', string_with_newlines)) # match
print(re.search('.*someother', string_with_newlines)) # match

REMARKS

?, *, {0,}は直前の文字が０回以上繰り返されるといういみでは共通ですが, 以下のような違いがあります.

?: 最左最短マッチ
*, {0,}: 最大左マッチ

string_with_newlines = """something\nsomeotherthing"""

print(re.search('thing', string_with_newlines).group())
>>> thing

print(re.search('.{0,}thing', string_with_newlines).group())
>>> something

print(re.search('.?thing', string_with_newlines).group())
>>> ething

print(re.search('.*thing', string_with_newlines).group())
>>> something

loop methodとの比較

正規表現を用いた検索はできませんが, 指定したPATTERNを含む文字列をSOURCEからlistで出力する方法として loopで以下のように処理する方法もあります

def loop_grep(pattern: str, word_list: list):
    res = []
    for word in word_list:
        if pattern in word:
            res.append(word)
    return res

実行時間の比較

import time
import random
import string
from itertools import compress
import re

def pygrep(pattern: str, word_list: list):
    list_idx = map(lambda x: bool(re.match(pattern, x)), word_list)
    res = list(compress(word_list, list_idx))
    return res

def loop_grep(pattern: str, word_list: list):
    res = []
    for word in word_list:
        if pattern in word:
            res.append(word)
    return res

def generate_word(LENGTH=10):
    word = [random.choice(string.ascii_lowercase) for _ in range(LENGTH)]
    word = ''.join(word)
    return word

mapsearch_execute_time = []
loopsearch_execute_time = []

for list_size in range(1000, 100000, 1000):
    tmp_list_map = 0
    tmp_list_loop = 0
    for j in range(10):
        wordlist = [generate_word() for _ in range(list_size)]

        start_loop = time.time()
        res = loop_grep('python', wordlist)
        tmp_list_loop += time.time() - start_loop

        start_map = time.time()
        res = pygrep('python', wordlist)
        tmp_list_map += time.time() - start_map

    mapsearch_execute_time.append(tmp_list_map/5)
    loopsearch_execute_time.append(tmp_list_loop/5)

可視化コードは以下です

from matplotlib import pyplot as plt
import numpy as np

fig, ax = plt.subplots()

x = np.arange(1000, 100000, 1000)

ax.plot(x, mapsearch_execute_time, label='map')
ax.plot(x, loopsearch_execute_time, label='loop')
ax.set_xlabel('list size')
ax.set_ylabel('run-time')

ax.legend()

ものすごく, pygrepのほうが遅い…

References

stackoverflow > What is the difference between re.search and re.match?

Share Buttons
Share on:

Feature Tags
Leave a Comment
(注意：GitHub Accountが必要となります）

List of stringsから正規表現で指定文字列を検索する

Pythonista Tips 3/N

What I Want to Do?

Solution

なぜ`re.search`なのか？

loop methodとの比較

References

CONTENTS

What I Want to Do?

Solution

なぜre.searchなのか？

loop methodとの比較

References

CONTENTS

なぜ`re.search`なのか？