regex - Python urllib2 request blinking cursor no response -

- August 15, 2015

i'm trying extract data bbb no response. don't error messages, blinking cursor. regex issue? also, if see can improve on in terms of efficiency or coding style, open advice!

here code:

import urllib2 import re  print "enter industry keyword." print "example: florists, construction, tiles"  keyword = raw_input('> ')  print "how many pages dig through bbb?" total_pages = raw_input('> ')  print "working..."  page_number = 1 address_list = []  url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)  req = urllib2.request(url) req.add_header('user-agent', 'mozilla/5.0') resp = urllib2.urlopen(req) respdata = resp.read()  address_pattern = r'<address>(.*?)<\/address>'  while page_number <= total_pages:      business_address = re.findall(address_pattern,str(respdata))      each in business_address:         address_list.append(each)      page_number += 1  each in address_list:     print each  print "\n save text file? hit enter if so.\n" raw_input('>')  file = open('export.txt','w')  each in address_list:     file.write('%r \n' % each)  file.close()  print 'file saved!'

edited, still don't results:

import urllib2 import re  print "enter industry keyword." print "example: florists, construction, tiles"  keyword = raw_input('> ')  print "how many pages dig through bbb?" total_pages = int(raw_input('> '))  print "working..."  page_number = 1 address_list = []  page_number in range(1,total_pages):      url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)      req = urllib2.request(url)     req.add_header('user-agent', 'mozilla/5.0')     resp = urllib2.urlopen(req)     respdata = resp.read()      address_pattern = r'<address>(.*?)<\/address>'      business_address = re.findall(address_pattern,respdata)      address_list.extend(business_address)  each in address_list:     print each  print "\n save text file? hit enter if so.\n" raw_input('>')  file = open('export.txt','w')  each in address_list:     file.write('%r \n' % each)  file.close()  print 'file saved!'

convert total_pages using int , use range instead of while loop:

total_pages = int(raw_input('> ')) ...............  page_number in range(2, total_pages+1):

that fix issue loop redundant, use same respdata , address_pattern in loop keep adding same thing repeatedly, if want crawl multiple pages need move urllib code inside loop crawl using each page_number:

for page_number in range(1, total_pages):     url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)     req = urllib2.request(url)     req.add_header('user-agent', 'mozilla/5.0')     resp = urllib2.urlopen(req)     respdata = resp.read()      business_address = re.findall(address_pattern, respdata)     # use extend add data findall     address_list.extend(business_address)

respdata string don't need call str on it, using requests can simplify code further:

import requests  page_number in range(1,total_pages):     url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)     respdata = requests.get(url).content     business_address = re.findall(address_pattern,str(respdata))     address_list.extend(business_address)

Search This Blog

Core code

regex - Python urllib2 request blinking cursor no response -

Comments

Post a Comment

Popular posts from this blog

php - Admin SDK -- get information about the group -

Python Error - TypeError: input expected at most 1 arguments, got 3 -

qt - Passing a QObject to an Script function with QJSEngine? -