multithreading - Using thread causes "python.exe has stopped working" -

- May 15, 2014

recently tried add thread scraper can have higher efficiency while scraping.

but somehow randomly causes python.exe "has stopped working" no further information given hence have no idea how debug it.

here relevant code:

where threads initiated:

def run(self): """ create threads , run scraper :return: """ self.__load_resource() self.__prepare_threads_args() # each thread allocated different set of links scrape from, these should no collision. item in self.threads_args:     try:         t = threading.thread(target=self.urllib_method, args=(item,))         # use following expression use selenium scraper         # t = threading.thread(target=self.__scrape_site, args=(item,))          self.threads.append(t)         t.start()     except exception ex:         print ex

what scraper like:

def urllib_method(self, thread_args):  """ :param thread_args:  arguments containing  files scrape , proxy use :return: """  site_scraper = sitescraper() file in thread_args["files"]:         current_folder_path = self.__prepare_output_folder(file["name"])          articles_without_comments_file = os.path.join(current_folder_path, "articles_without_comments")         articles_without_comments_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else []          articles_scraped_file = os.path.join(current_folder_path, "articles_scraped")         articles_scraped_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else []          links = get_links_from_file(file["path"])         link in links:             article_id = extract_article_id(link)              if isfile(join(current_folder_path, article_id)):                 print "skip: ", link                 if link not in articles_scraped_links:                     append_text_to_file(articles_scraped_file, link)                 continue             if link in articles_without_comments_links:                 continue              comments = site_scraper.call_comments_endpoint(article_id, thread_args["proxy"])              if comments != "pro article" , comments != "crash" , comments != "no comments" , comments not none:                 print article_id, comments[0:14]                 write_text_to_file(os.path.join(current_folder_path, article_id), comments)                 sleep(1)                 append_text_to_file(articles_scraped_file, link)             elif comments == "no comments":                 print "article without comments: ",  article_id                 if link not in articles_without_comments_links:                     append_text_to_file(articles_without_comments_file, link)                 sleep(1)

i have tried run script on both windows 10 , 8.1, issue exists on both of them.

also, more data scraped, more frequent happens. , more threads used, more frequent happens.

threads in python pre 3.2 unsafe use, due diabolical global interpreter lock.

the preferred way utilize multiple cores , processes in python via multiprocessing package.

https://docs.python.org/2/library/multiprocessing.html

Search This Blog

Core code

multithreading - Using thread causes "python.exe has stopped working" -

Comments

Post a Comment

Popular posts from this blog

php - Admin SDK -- get information about the group -

Python Error - TypeError: input expected at most 1 arguments, got 3 -

qt - Passing a QObject to an Script function with QJSEngine? -