Recursive group capturing regex with backreference in JAVA -
i trying capture multiple groups recursively in string using backreference group within regex. though using pattern , matcher , "while(matcher.find())" loop, still capturing last instance instead of instances. in case possible tags <sm>,<po>,<pof>,<pos>,<poi>,<pol>,<poif>,<poil>. since these formatting tags, need capture:
- any text outside of tag (so can format "normal" text, , going capturing text before tag in 1 group while capture tag in group, , iterate through occurrences remove has been captured original string; if have text left on in end format "normal" text)
- the "name" of tag know how have format text inside tag
- the text contents of tag formatted accordingly tag name , associated rules
here sample code:
string currenttext = "the man said:<pof>“this one, @ last, bone of bones</pof><poi>and flesh of flesh;</poi><po>this 1 shall called ‘woman,’</po><poil>for out of man 1 has been taken.”</poil>"; string remainingtext = currenttext; //first check if our string has kind of xml tag, because if not format whole string "normal" text if(currenttext.matches("(?su).*<[/]{0,1}(?:sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1}>.*")) { //an opening or closing tag has been found, let start our pattern captures //i using backreference \\2 make sure closing tag same opening tag pattern pattern1 = pattern.compile("(.*)<((sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1})>(.*?)</\\2>",pattern.unicode_character_class); matcher matcher1 = pattern1.matcher(currenttext); int iteration = 0; while(matcher1.find()){ system.out.print("iteration "); system.out.println(++iteration); system.out.println("group1:"+matcher1.group(1)); system.out.println("group2:"+matcher1.group(2)); system.out.println("group3:"+matcher1.group(3)); system.out.println("group4:"+matcher1.group(4)); if(matcher1.group(1) != null && matcher1.group(1).isempty() == false) { m_xtext.insertstring(xtextrange, matcher1.group(1), false); remainingtext = remainingtext.replacefirst(matcher1.group(1), ""); } if(matcher1.group(4) != null && matcher1.group(4).isempty() == false) { switch (matcher1.group(2)) { case "pof": [...] case "pos": [...] case "poif": [...] case "po": [...] case "poi": [...] case "pol": [...] case "poil": [...] case "sm": [...] } remainingtext = remainingtext.replacefirst("<"+matcher1.group(2)+">"+matcher1.group(4)+"</"+matcher1.group(2)+">", ""); } }
the system.out.println outputting once in console, these results:
iteration 1: group1:the man said:<pof>“this one, @ last, bone of bones</pof><poi>and flesh of flesh;</poi><po>this 1 shall called ‘woman,’</po>; group2:poil group3:po group4:for out of man 1 has been taken.”
group 3 ignored, useful groups 1, 2 , 4 (group 3 part of group 2). why capturing last tag instance "poil", while not capturing preceding "pof", "poi", , "po" tags?
the output see this:
iteration 1: group1:the man said: group2:pof group3:po group4:“this one, @ last, bone of bones iteration 2: group1: group2:poi group3:po group4:and flesh of flesh; iteration 3: group1: group2:po group3:po group4:this 1 shall called ‘woman,’ iteration 3: group1: group2:poil group3:po group4:for out of man 1 has been taken.”
i found answer problem, needed non-greedy quantifier in first capture, had in fourth capture group. working needed:
pattern pattern1 = pattern.compile("(.*?)<((sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1})>(.*?)</\\2>",pattern.unicode_character_class);
Comments
Post a Comment