Unicode Regex with regex not working in Python -

- June 15, 2013

i have following regex (see in action in pcre)

.*?\p{l}*?(\p{l}+-?(\p{l}+)?)\p{l}*$

however, python doesn't upport unicode regex \p{} syntax. solve i read use regex module (not default re), doesn't seem work either. not u flag.

example:

sentence = "valt nog zoveel zal kunnen zeggen, "  print(re.sub(".*?\p{l}*?(\p{l}+-?(\p{l}+)?)\p{l}*$","\1",sentence))

output: < blank >
expected output: zeggen

this doesn't work python 3.4.3.

as can see unicode character classes \p{l} not available in re module. doesn't means can't re module since \p{l} can replaced [^\w\d_] unicode flag (even if there small differences between these 2 character classes, see link in comments).

second point, approach not 1 (if understand well, trying extract last word of each line) because have strangely decided remove not last word (except newline) replacement. ~52000 steps extract 10 words in 10 lines of text not acceptable (and crash more characters). more efficient way consists find last words, see example:

import re  s = '''ik heb nog nooit een kat gezien zo lélijk! het een minder lelijk dan uw hond.'''  p = re.compile(r'^.*\b(?<!-)(\w+(?:-\w+)*)', re.m | re.u)   words = p.findall(s)  print('\n'.join(words))

notices:

to obtain same result python 2.7 need add u before single quotes of string: s = u'''...
if absolutely want limit results letters avoiding digits , underscores, replace \w [^\w\d_] in pattern.
if use regex module, maybe character class \p{islatin} more appropriate use, or whatever module choose, more explicit class needed characters, like: [a-za-záéóú...
you can achieve same regex module pattern:
p = regex.compile(r'^.*\m(?<!-)(\pl+(?:-\pl+)*)', regex.m | regex.u)

other ways:

by line re module:

p = re.compile(r'[^\w-]+', re.u) line in s.split('\n'):     print(p.split(line+' ')[-2])

with regex module can take advantage of reversed search:

p = regex.compile(r'(?r)\w+(?:-\w+)*\m', regex.u) line in s.split('\n'):     print p.search(line).group(0)

Search This Blog

Core code

Unicode Regex with regex not working in Python -

Comments

Post a Comment

Popular posts from this blog

php - Admin SDK -- get information about the group -

Python Error - TypeError: input expected at most 1 arguments, got 3 -

qt - Passing a QObject to an Script function with QJSEngine? -