02-09-2019, 10:29 AM
|
|
|
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
Quote:
More precisely, I crawled 250,113,669 pages for just under 580 dollars in 39 hours and 25 minutes, using 20 Amazon EC2 machine instances.
I carried out this project because (among several other reasons) I wanted to understand what resources are required to crawl a small but non-trivial fraction of the web. In this post I describe some details of what I did. Of course, there’s nothing especially new: I wrote a vanilla (distributed) crawler, mostly to teach myself something about crawling and distributed computing.
Still, I learned some lessons that may be of interest to a few others, and so in this post I describe what I did. The post also mixes in some personal working notes, for my own future reference.
|
How to crawl a quarter billion webpages in 40 hours | DDI
|
|
|