Regex in Python dilemma -
i having issues previous parts of code, fixed few things, can't figure out how retrieve data , use regex correctly. i'm trying retrieve full address bbb link in url variable. how can pull company name , address efficiently?
here code:
import urllib2 import re print "enter industry keyword." print "example: florists, construction, tiles" keyword = raw_input('> ') print "how many pages dig through bbb?" total_pages = int(raw_input('> ')) print "working..." page_number = 1 address_list = [] address_pattern = r'<address>(.*?)<\/address>' # here issue page_number in range(1,total_pages): url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number) req = urllib2.request(url) req.add_header('user-agent', 'mozilla/5.0') resp = urllib2.urlopen(req) respdata = resp.read() business_address = re.findall(address_pattern,str(respdata)) address_list.extend(business_address) each in address_list: print each print "\n save text file? hit enter if so.\n" raw_input('>') file = open('export.txt','w') each in address_list: file.write('%r \n' % each) file.close() print 'file saved!'
your pattern must be,
address_pattern = r'<address>(.*?)<\/address>'
not *.?
. (.*?)
non-greedy match. string exists between tag group index 1. unfortunately this, won't match if there newline character present in text. suggest enable dotall modifier.
address_pattern = r'(?s)<address>(.*?)<\/address>'
Comments
Post a Comment