Regex in Python dilemma -

- February 15, 2010

i having issues previous parts of code, fixed few things, can't figure out how retrieve data , use regex correctly. i'm trying retrieve full address bbb link in url variable. how can pull company name , address efficiently?

here code:

import urllib2 import re  print "enter industry keyword." print "example: florists, construction, tiles"  keyword = raw_input('> ')  print "how many pages dig through bbb?" total_pages = int(raw_input('> '))  print "working..."  page_number = 1 address_list = []  address_pattern = r'<address>(.*?)<\/address>' # here issue  page_number in range(1,total_pages):      url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)     req = urllib2.request(url)     req.add_header('user-agent', 'mozilla/5.0')     resp = urllib2.urlopen(req)     respdata = resp.read()      business_address = re.findall(address_pattern,str(respdata))     address_list.extend(business_address)  each in address_list:     print each  print "\n save text file? hit enter if so.\n" raw_input('>')  file = open('export.txt','w')  each in address_list:     file.write('%r \n' % each)  file.close()  print 'file saved!'

your pattern must be,

address_pattern = r'<address>(.*?)<\/address>'

not *.?. (.*?) non-greedy match. string exists between tag group index 1. unfortunately this, won't match if there newline character present in text. suggest enable dotall modifier.

address_pattern = r'(?s)<address>(.*?)<\/address>'

Search This Blog

Core code

Regex in Python dilemma -

Comments

Post a Comment

Popular posts from this blog

php - Admin SDK -- get information about the group -

Python Error - TypeError: input expected at most 1 arguments, got 3 -

qt - Passing a QObject to an Script function with QJSEngine? -