Node Crawler

1. Node crawler

It’s been a long time when my last blog. This time I’m going to start an open source project called node-crawler. Let’s get straight to the point, the project includes three parts:

1.1 Admin Dashboard

It’s a distribute crawling service, the admin can check the crawler status, manage the crawling data, control the crawler client, manage the client config and deliver the client to different servers through the platform. Or some functions like data analyze or data statics. (This backend mainly developed with angularJs ).

1.2 Backend Service

Nodejs backend service which is responsible for collecting the uploaded data. Mainly include some functions like filter the data, analyze the crawling result and store into the database.

1.3 Crawler Client

The client is responsible for crawling the pages, execute scripts, control spider amount,. Think that the backend service send some configs, the client read it and do the crawling job the server send it.

Actually the client should be smart to avoid preventing by the crawling target server, It should have a lot of strategies like changing the ip, changing the agent, limit the spider speed..and so on.

The server send message to check the client and start the client, send config like to start the client:

1
2
3
4
5
var clientConfig = {
  speed: 100, //100/s to access the target website
  workers: 10, //send 10 workers together crawling the website
  //some other configs
}

the client start to crawling work as the config:

1
2
3
4
5
6
var crawlerConfig = {
  target: "http://www.taobao.com/p/xxxxx", //taget crawling website
  element: '', //target element
  attr: '', //the real data to grab
  //some other configs about the target web page
}

This blog is just an introduction of the crawling system. Other blogs will introduce how to implement it! Welcome!