javascript - Options for article scraping from many different websites -


i need add webpage scraping functionality single page application.

i need retrieve useful content many different blogs , services. useful content, mean articles, texts , links videos in order embed them on pages.

this tool seems offer need: http://www.diffbot.com/

using it, can input article's url , service retrieve data need single page.
however, not need handle 250 thousands requests on monthly basis, cost $300 each month; need solution handle 5000 requests each month, possibility of scaling later.

i've found lot of scraping solutions through google, offer solutions scrape custom content periodically small number of websites - not need. also, not have experience in area, advise me on should use purpose. dealing javascript.

in addition, @ possible allow pages scraped client's browser, rather server-side?

i develop spa reactjs , flux architecture. server nodejs+express, database - backendless

it sounds custom solution perhaps built on node.js best bet (taking consideration js requirement). there several node modules use accomplish this. recommend following:

request - used grab html target webpage

cheerio - used filter html gathered request

node-horseman - used execute javascript on target web page (for more advanced scraping)

artoo - client side scraping library (i've never used may looking for)

as spa development recommend sailsjs.

here example node app using above modules scrape https://rotogrinders.com/pages/mlb-pitcher-hub-sp-salary-charts-260515

request & cheerio:

var cheerio = require('cheerio'),     request = require('request');  //define target url , http method here var options = {   url: 'https://rotogrinders.com/pages/mlb-pitcher-hub-sp-salary-charts-260515',   method: 'get' }  // use request grab html defined in options , return contents in "body" request(options, function (err, res, body) {   if (!err && res.statuscode == 200) {      // load "body" cheerio     var $ = cheerio.load(body);      // grab each occurrence of matched html (use chrome developer tools determine css) using cheerio     $('tbody').children().each(function(i, element){       var $element = $(element);       var name = $element.children().eq(0).text().trim();       var salary = $element.children().eq(3).text().trim();        // put filtered data in object       var post = {         name: name,         salary: salary       }       // print object console       console.log(post);     });   } }); 

Comments

Popular posts from this blog

php - Admin SDK -- get information about the group -

dns - How To Use Custom Nameserver On Free Cloudflare? -

Python Error - TypeError: input expected at most 1 arguments, got 3 -