multithreading - Using thread causes "python.exe has stopped working" -
recently tried add thread scraper can have higher efficiency while scraping.
but somehow randomly causes python.exe "has stopped working" no further information given hence have no idea how debug it.
here relevant code:
where threads initiated:
def run(self): """ create threads , run scraper :return: """ self.__load_resource() self.__prepare_threads_args() # each thread allocated different set of links scrape from, these should no collision. item in self.threads_args: try: t = threading.thread(target=self.urllib_method, args=(item,)) # use following expression use selenium scraper # t = threading.thread(target=self.__scrape_site, args=(item,)) self.threads.append(t) t.start() except exception ex: print ex
what scraper like:
def urllib_method(self, thread_args): """ :param thread_args: arguments containing files scrape , proxy use :return: """ site_scraper = sitescraper() file in thread_args["files"]: current_folder_path = self.__prepare_output_folder(file["name"]) articles_without_comments_file = os.path.join(current_folder_path, "articles_without_comments") articles_without_comments_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else [] articles_scraped_file = os.path.join(current_folder_path, "articles_scraped") articles_scraped_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else [] links = get_links_from_file(file["path"]) link in links: article_id = extract_article_id(link) if isfile(join(current_folder_path, article_id)): print "skip: ", link if link not in articles_scraped_links: append_text_to_file(articles_scraped_file, link) continue if link in articles_without_comments_links: continue comments = site_scraper.call_comments_endpoint(article_id, thread_args["proxy"]) if comments != "pro article" , comments != "crash" , comments != "no comments" , comments not none: print article_id, comments[0:14] write_text_to_file(os.path.join(current_folder_path, article_id), comments) sleep(1) append_text_to_file(articles_scraped_file, link) elif comments == "no comments": print "article without comments: ", article_id if link not in articles_without_comments_links: append_text_to_file(articles_without_comments_file, link) sleep(1)
i have tried run script on both windows 10 , 8.1, issue exists on both of them.
also, more data scraped, more frequent happens. , more threads used, more frequent happens.
threads in python pre 3.2 unsafe use, due diabolical global interpreter lock.
the preferred way utilize multiple cores , processes in python via multiprocessing package.
Comments
Post a Comment