regex - Python urllib2 request blinking cursor no response -
i'm trying extract data bbb no response. don't error messages, blinking cursor. regex issue? also, if see can improve on in terms of efficiency or coding style, open advice!
here code:
import urllib2 import re print "enter industry keyword." print "example: florists, construction, tiles" keyword = raw_input('> ') print "how many pages dig through bbb?" total_pages = raw_input('> ') print "working..." page_number = 1 address_list = [] url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number) req = urllib2.request(url) req.add_header('user-agent', 'mozilla/5.0') resp = urllib2.urlopen(req) respdata = resp.read() address_pattern = r'<address>(.*?)<\/address>' while page_number <= total_pages: business_address = re.findall(address_pattern,str(respdata)) each in business_address: address_list.append(each) page_number += 1 each in address_list: print each print "\n save text file? hit enter if so.\n" raw_input('>') file = open('export.txt','w') each in address_list: file.write('%r \n' % each) file.close() print 'file saved!'
edited, still don't results:
import urllib2 import re print "enter industry keyword." print "example: florists, construction, tiles" keyword = raw_input('> ') print "how many pages dig through bbb?" total_pages = int(raw_input('> ')) print "working..." page_number = 1 address_list = [] page_number in range(1,total_pages): url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number) req = urllib2.request(url) req.add_header('user-agent', 'mozilla/5.0') resp = urllib2.urlopen(req) respdata = resp.read() address_pattern = r'<address>(.*?)<\/address>' business_address = re.findall(address_pattern,respdata) address_list.extend(business_address) each in address_list: print each print "\n save text file? hit enter if so.\n" raw_input('>') file = open('export.txt','w') each in address_list: file.write('%r \n' % each) file.close() print 'file saved!'
convert total_pages
using int
, use range instead of while loop:
total_pages = int(raw_input('> ')) ............... page_number in range(2, total_pages+1):
that fix issue loop redundant, use same respdata
, address_pattern
in loop keep adding same thing repeatedly, if want crawl multiple pages need move urllib code inside loop crawl using each page_number
:
for page_number in range(1, total_pages): url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number) req = urllib2.request(url) req.add_header('user-agent', 'mozilla/5.0') resp = urllib2.urlopen(req) respdata = resp.read() business_address = re.findall(address_pattern, respdata) # use extend add data findall address_list.extend(business_address)
respdata
string don't need call str
on it, using requests can simplify code further:
import requests page_number in range(1,total_pages): url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number) respdata = requests.get(url).content business_address = re.findall(address_pattern,str(respdata)) address_list.extend(business_address)
Comments
Post a Comment