multithreading - Using thread causes "python.exe has stopped working" -


recently tried add thread scraper can have higher efficiency while scraping.

but somehow randomly causes python.exe "has stopped working" no further information given hence have no idea how debug it.

here relevant code:

  1. where threads initiated:

    def run(self): """ create threads , run scraper :return: """ self.__load_resource() self.__prepare_threads_args() # each thread allocated different set of links scrape from, these should no collision. item in self.threads_args:     try:         t = threading.thread(target=self.urllib_method, args=(item,))         # use following expression use selenium scraper         # t = threading.thread(target=self.__scrape_site, args=(item,))          self.threads.append(t)         t.start()     except exception ex:         print ex 
  2. what scraper like:

    def urllib_method(self, thread_args):  """ :param thread_args:  arguments containing  files scrape , proxy use :return: """  site_scraper = sitescraper() file in thread_args["files"]:         current_folder_path = self.__prepare_output_folder(file["name"])          articles_without_comments_file = os.path.join(current_folder_path, "articles_without_comments")         articles_without_comments_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else []          articles_scraped_file = os.path.join(current_folder_path, "articles_scraped")         articles_scraped_links = get_links_from_file(articles_without_comments_file) if isfile(articles_without_comments_file) else []          links = get_links_from_file(file["path"])         link in links:             article_id = extract_article_id(link)              if isfile(join(current_folder_path, article_id)):                 print "skip: ", link                 if link not in articles_scraped_links:                     append_text_to_file(articles_scraped_file, link)                 continue             if link in articles_without_comments_links:                 continue              comments = site_scraper.call_comments_endpoint(article_id, thread_args["proxy"])              if comments != "pro article" , comments != "crash" , comments != "no comments" , comments not none:                 print article_id, comments[0:14]                 write_text_to_file(os.path.join(current_folder_path, article_id), comments)                 sleep(1)                 append_text_to_file(articles_scraped_file, link)             elif comments == "no comments":                 print "article without comments: ",  article_id                 if link not in articles_without_comments_links:                     append_text_to_file(articles_without_comments_file, link)                 sleep(1) 

i have tried run script on both windows 10 , 8.1, issue exists on both of them.

also, more data scraped, more frequent happens. , more threads used, more frequent happens.

threads in python pre 3.2 unsafe use, due diabolical global interpreter lock.

the preferred way utilize multiple cores , processes in python via multiprocessing package.

https://docs.python.org/2/library/multiprocessing.html


Comments

Popular posts from this blog

php - Admin SDK -- get information about the group -

dns - How To Use Custom Nameserver On Free Cloudflare? -

Python Error - TypeError: input expected at most 1 arguments, got 3 -