javascript - Options for article scraping from many different websites -
i need add webpage scraping functionality single page application.
i need retrieve useful content many different blogs , services. useful content, mean articles, texts , links videos in order embed them on pages.
this tool seems offer need: http://www.diffbot.com/
using it, can input article's url , service retrieve data need single page.
however, not need handle 250 thousands requests on monthly basis, cost $300 each month; need solution handle 5000 requests each month, possibility of scaling later.
i've found lot of scraping solutions through google, offer solutions scrape custom content periodically small number of websites - not need. also, not have experience in area, advise me on should use purpose. dealing javascript.
in addition, @ possible allow pages scraped client's browser, rather server-side?
i develop spa reactjs , flux architecture. server nodejs+express, database - backendless
it sounds custom solution perhaps built on node.js best bet (taking consideration js requirement). there several node modules use accomplish this. recommend following:
request - used grab html target webpage
cheerio - used filter html gathered request
node-horseman - used execute javascript on target web page (for more advanced scraping)
artoo - client side scraping library (i've never used may looking for)
as spa development recommend sailsjs.
here example node app using above modules scrape https://rotogrinders.com/pages/mlb-pitcher-hub-sp-salary-charts-260515
request & cheerio:
var cheerio = require('cheerio'), request = require('request'); //define target url , http method here var options = { url: 'https://rotogrinders.com/pages/mlb-pitcher-hub-sp-salary-charts-260515', method: 'get' } // use request grab html defined in options , return contents in "body" request(options, function (err, res, body) { if (!err && res.statuscode == 200) { // load "body" cheerio var $ = cheerio.load(body); // grab each occurrence of matched html (use chrome developer tools determine css) using cheerio $('tbody').children().each(function(i, element){ var $element = $(element); var name = $element.children().eq(0).text().trim(); var salary = $element.children().eq(3).text().trim(); // put filtered data in object var post = { name: name, salary: salary } // print object console console.log(post); }); } });
Comments
Post a Comment