python - Beautiful soup webscrape into mysql -

- September 15, 2011

the code far downloads , prints onto screen,but how printed material sql database.if wanted data csv files seems python(on day) creates file automatically.obviously transferring mysql assume have create database beforehand in order receive data.my question how data scrape database omitting csv step altogether. in anticipation have downloaded pymysql library.any suggestions aprreciated..looknow

from urllib import urlopen bs4 import beautifulsoup html = urlopen("http://www.officialcharts.com/charts/singles-      chart/19800203/7501/" )   bsobj = beautifulsoup(html)  namelist = bsobj. findall("div" , {"class" : "artist",})  name in namelist:  print(name. get_text())   html = urlopen("http://www.officialcharts.com/charts/singles-    chart/19800203/7501/" )  bsobj = beautifulsoup(html)  namelist = bsobj. findall("div" , {"class" : "title"})  name in namelist:  print(name. get_text())

so there couple things address here.

the docs on pymysql pretty @ getting , running.

before can put these things database though, need grab them in way artist , song name associated each other. right getting separate list of artists , songs, no way associate them. want iterate on title-artist class this.

i -

from urllib import urlopen bs4 import beautifulsoup import pymysql.cursors  # webpage connection html = urlopen("http://www.officialcharts.com/charts/singles-chart/19800203/7501/")  # grab title-artist classes , iterate bsobj = beautifulsoup(html) recordlist = bsobj.findall("div", {"class" : "title-artist",})  # iterate on recordlist grab title , artist record in recordlist:      title = record.find("div", {"class": "title",}).get_text().strip()      artist = record.find("div", {"class": "artist"}).get_text().strip()      print artist + ': ' + title

this print title , artist each iteration of recordlist loop.

to insert these values mysql db, created table called artist_song following:

create table `artist_song` (   `id` int(11) not null auto_increment,   `artist` varchar(255) collate utf8_bin not null,   `song` varchar(255) collate utf8_bin not null,   primary key (`id`)   ) engine=innodb default charset=utf8 collate=utf8_bin   auto_increment=1;

this isn't cleanest way go this, idea sound. want open connection mysql db (i have called db top_40), , insert artist/title pair each iteration of recordlist loop:

from urllib import urlopen bs4 import beautifulsoup import pymysql.cursors   # webpage connection html = urlopen("http://www.officialcharts.com/charts/singles-chart/19800203/7501/")  # grab title-artist classes , store in recordlist bsobj = beautifulsoup(html) recordlist = bsobj.findall("div", {"class" : "title-artist",})  # create pymysql cursor , iterate on each title-artist record. # create insert statement each artist/pair, commit # transaction after reaching end of list. pymysql not # have autocommit enabled default. after committing close # database connection. # create database connection  connection = pymysql.connect(host='localhost',                              user='root',                              password='password',                              db='top_40',                              charset='utf8mb4',                              cursorclass=pymysql.cursors.dictcursor)  try:     connection.cursor() cursor:         record in recordlist:             title = record.find("div", {"class": "title",}).get_text().strip()             artist = record.find("div", {"class": "artist"}).get_text().strip()             sql = "insert `artist_song` (`artist`, `song`) values (%s, %s)"             cursor.execute(sql, (artist, title))     connection.commit() finally:     connection.close()

edit: per comment, think clearer iterate on table rows instead:

from urllib import urlopen bs4 import beautifulsoup import pymysql.cursors   # webpage connection html = urlopen("http://www.officialcharts.com/charts/singles-chart/19800203/7501/")  bsobj = beautifulsoup(html)  rows = bsobj.findall('tr') row in rows:     if row.find('span', {'class' : 'position'}):         position = row.find('span', {'class' : 'position'}).get_text().strip()         artist = row.find('div', {'class' : 'artist'}).get_text().strip()         track = row.find('div', {'class' : 'title'}).get_text().strip()

Search This Blog

Core code

python - Beautiful soup webscrape into mysql -

Comments

Post a Comment

Popular posts from this blog

php - Admin SDK -- get information about the group -

Python Error - TypeError: input expected at most 1 arguments, got 3 -

qt - Passing a QObject to an Script function with QJSEngine? -