regex - Python urllib2 request blinking cursor no response -


i'm trying extract data bbb no response. don't error messages, blinking cursor. regex issue? also, if see can improve on in terms of efficiency or coding style, open advice!

here code:

import urllib2 import re  print "enter industry keyword." print "example: florists, construction, tiles"  keyword = raw_input('> ')  print "how many pages dig through bbb?" total_pages = raw_input('> ')  print "working..."  page_number = 1 address_list = []  url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)  req = urllib2.request(url) req.add_header('user-agent', 'mozilla/5.0') resp = urllib2.urlopen(req) respdata = resp.read()  address_pattern = r'<address>(.*?)<\/address>'  while page_number <= total_pages:      business_address = re.findall(address_pattern,str(respdata))      each in business_address:         address_list.append(each)      page_number += 1  each in address_list:     print each  print "\n save text file? hit enter if so.\n" raw_input('>')  file = open('export.txt','w')  each in address_list:     file.write('%r \n' % each)  file.close()  print 'file saved!' 

edited, still don't results:

import urllib2 import re  print "enter industry keyword." print "example: florists, construction, tiles"  keyword = raw_input('> ')  print "how many pages dig through bbb?" total_pages = int(raw_input('> '))  print "working..."  page_number = 1 address_list = []  page_number in range(1,total_pages):      url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)      req = urllib2.request(url)     req.add_header('user-agent', 'mozilla/5.0')     resp = urllib2.urlopen(req)     respdata = resp.read()      address_pattern = r'<address>(.*?)<\/address>'      business_address = re.findall(address_pattern,respdata)      address_list.extend(business_address)  each in address_list:     print each  print "\n save text file? hit enter if so.\n" raw_input('>')  file = open('export.txt','w')  each in address_list:     file.write('%r \n' % each)  file.close()  print 'file saved!' 

convert total_pages using int , use range instead of while loop:

total_pages = int(raw_input('> ')) ...............  page_number in range(2, total_pages+1): 

that fix issue loop redundant, use same respdata , address_pattern in loop keep adding same thing repeatedly, if want crawl multiple pages need move urllib code inside loop crawl using each page_number:

for page_number in range(1, total_pages):     url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)     req = urllib2.request(url)     req.add_header('user-agent', 'mozilla/5.0')     resp = urllib2.urlopen(req)     respdata = resp.read()      business_address = re.findall(address_pattern, respdata)     # use extend add data findall     address_list.extend(business_address) 

respdata string don't need call str on it, using requests can simplify code further:

import requests  page_number in range(1,total_pages):     url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)     respdata = requests.get(url).content     business_address = re.findall(address_pattern,str(respdata))     address_list.extend(business_address) 

Comments

Popular posts from this blog

php - Admin SDK -- get information about the group -

dns - How To Use Custom Nameserver On Free Cloudflare? -

Python Error - TypeError: input expected at most 1 arguments, got 3 -