Node Crawler
1. Node crawler
It’s been a long time when my last blog. This time I’m going to start an open source project called node-crawler. Let’s get straight to the point, the project includes three parts:
1.1 Admin Dashboard
It’s a distribute crawling service, the admin can check the crawler status, manage the crawling data, control the crawler client, manage the client config and deliver the client to different servers through the platform. Or some functions like data analyze or data statics. (This backend mainly developed with angularJs ).
1.2 Backend Service
Nodejs backend service which is responsible for collecting the uploaded data. Mainly include some functions like filter the data, analyze the crawling result and store into the database.
1.3 Crawler Client
The client is responsible for crawling the pages, execute scripts, control spider amount,. Think that the backend service send some configs, the client read it and do the crawling job the server send it.
Actually the client should be smart to avoid preventing by the crawling target server, It should have a lot of strategies like changing the ip, changing the agent, limit the spider speed..and so on.
The server send message to check the client and start the client, send config like to start the client:
1 2 3 4 5 |
|
the client start to crawling work as the config:
1 2 3 4 5 6 |
|
This blog is just an introduction of the crawling system. Other blogs will introduce how to implement it! Welcome!