Unicode Regex with regex not working in Python -
i have following regex (see in action in pcre)
.*?\p{l}*?(\p{l}+-?(\p{l}+)?)\p{l}*$
however, python doesn't upport unicode regex \p{}
syntax. solve i read use regex
module (not default re
), doesn't seem work either. not u
flag.
example:
sentence = "valt nog zoveel zal kunnen zeggen, " print(re.sub(".*?\p{l}*?(\p{l}+-?(\p{l}+)?)\p{l}*$","\1",sentence))
- output: < blank >
- expected output:
zeggen
this doesn't work python 3.4.3.
as can see unicode character classes \p{l}
not available in re module. doesn't means can't re module since \p{l}
can replaced [^\w\d_]
unicode
flag (even if there small differences between these 2 character classes, see link in comments).
second point, approach not 1 (if understand well, trying extract last word of each line) because have strangely decided remove not last word (except newline) replacement. ~52000 steps extract 10 words in 10 lines of text not acceptable (and crash more characters). more efficient way consists find last words, see example:
import re s = '''ik heb nog nooit een kat gezien zo lélijk! het een minder lelijk dan uw hond.''' p = re.compile(r'^.*\b(?<!-)(\w+(?:-\w+)*)', re.m | re.u) words = p.findall(s) print('\n'.join(words))
notices:
to obtain same result python 2.7 need add
u
before single quotes of string:s = u'''...
if absolutely want limit results letters avoiding digits , underscores, replace
\w
[^\w\d_]
in pattern.if use regex module, maybe character class
\p{islatin}
more appropriate use, or whatever module choose, more explicit class needed characters, like:[a-za-záéóú...
you can achieve same regex module pattern:
p = regex.compile(r'^.*\m(?<!-)(\pl+(?:-\pl+)*)', regex.m | regex.u)
other ways:
by line re module:
p = re.compile(r'[^\w-]+', re.u) line in s.split('\n'): print(p.split(line+' ')[-2])
with regex module can take advantage of reversed search:
p = regex.compile(r'(?r)\w+(?:-\w+)*\m', regex.u) line in s.split('\n'): print p.search(line).group(0)
Comments
Post a Comment