Crawler Project

screenshot of Crawler Project
html

Crawler Project

Google资深工程师深度讲解Go语言 爬虫项目。

Overview:

The Crawler-website is a website crawler tool developed using the Go language. It allows users to crawl websites, extract data, and store it in an Elasticsearch database. The tool follows the MVC pattern and utilizes Docker for easy deployment and management. It also supports concurrent and distributed crawling using singleton, concurrent, and distributed architectures. The website provides a user-friendly interface to view and query the crawled data.

Features:

  • Go language: The crawler is developed using the Go programming language, known for its concurrency and performance benefits.
  • Docker: The tool can be easily deployed using Docker containers, simplifying the setup and management process.
  • Elasticsearch: The crawled data is stored in an Elasticsearch database, enabling efficient indexing and querying of the data.
  • MVC pattern: The crawler follows the Model-View-Controller (MVC) architectural pattern, separating the concerns of data storage, presentation, and user interaction.
  • Microservices: The distributed crawling capability is achieved using microservices, allowing the tool to scale and handle large volumes of data.
  • Singleton -> Concurrent -> Distribute: The crawler supports different architectures to handle crawling tasks, from a singleton setup to concurrent and distributed setups.

Installation:

To install the Crawler-website, follow these steps:

  • Install Go language and Docker.
  • Install necessary Go packages using the following commands:
go get golang.org/x/text
go get -v github.com/gpmgo/gopm
gopm get -g -v golang.org/x/text
gopm get -g -v golang.org/x/net/html
go get gopkg.in/olivere/elastic.v5
  • Start Docker by running the command:
docker run -d -p 9200:9200 elasticsearch
  • To start the singleton crawler, run the command:
go run src/crawler/main.go
  • To view the results in the website, run the command:
go run src/crawler/frontend/starter.go
  • Visit "http://localhost:8888/" in your browser and enter a query string in the REST format (e.g., "女 && Age>20").

For distributed crawling, follow similar steps but additionally:

  • Open a terminal and execute the command:
go run src/crawler/_distributed/persist/server/ItemSaver.go --port=1234
  • Open two additional terminals and execute the following commands in each:
go run src/crawler/_distributed/worker/server/worker.go --port=9000
go run src/crawler/_distributed/worker/server/worker.go --port=9001
  • In another terminal, execute the command:
go run src/crawler/_distributed/main.go --itemsaver_host=":1234" --worker_hosts=":9000,:9001"

Summary:

The Crawler-website is a powerful web crawling tool developed using the Go language. Its key features include Go language support, Docker integration, Elasticsearch compatibility, and support for singleton, concurrent, and distributed crawling setups. The installation process involves setting up Go language dependencies, starting Docker, and running the necessary Go scripts. With its user-friendly interface and efficient crawling capabilities, the Crawler-website is a valuable tool for extracting and analyzing data from websites.

html
HTML

HTML templates are pre-designed and pre-built web pages that can be customized and used as a basis for building websites. They often include common elements such as headers, footers, menus, and content sections, and can be easily edited using HTML and CSS to fit specific branding and content needs.