The Ultimate Guide To Building A Golang Web Scraper
It’s simply impossible for businesses to function effectively without access to the information and insights provided by big data. As you likely already know, web scraping is the most efficient method of gathering data — regardless of your industry. Although many publications issue predictions and trend reports based on data they’ve collected and analyzed, there’s no substitute for collecting and analyzing your own data. Web scraping allows you to customize your data strategy to give you the information you need to make the best decisions for your business by prioritizing the data that’s most relevant and valuable to your company.
There are many ready-made web scrapers available, so you may be able to find one that suits your needs. However, building one yourself offers you the flexibility to create exactly what you need while skipping what you don’t. You can build a web scraper in any language, but Go — also known as Golang — offers some significant benefits that make it a great choice.
This guide is a complete resource for scraping data from websites with Go. You’ll find everything from the pros and cons of using Go for your web scraper to common problems and how to fix them. If you’re looking for something specific, feel free to use the table of contents to skip around.
The Benefits of Go
With so many available languages to choose from, you may be wondering why you should use Go to build a web scraper. There are certainly more popular languages you could use — Go doesn’t even crack the top 10 when it comes to most used coding languages. However, despite its little-sibling status compared to options such as C or Python, the functional minimalism of Go makes it appealing for web scraping.
What is Go?
Go began as sketches on a whiteboard at Google in 2007. Creators Robert Griesemer, Rob Pike, and Ken Thompson were frustrated with the programming languages available at the time. They felt other languages forced them to choose among efficient compilation, efficient execution, and ease of programming, so they combined the three and created Go.
Go combines the ease of a dynamically typed interpretive language with the efficiency of a compiled, statically typed language.
Go was designed to be fast and easy to learn. As an aside, the language is officially named Go, but is often referred to as Golang because of its website, which is golang.org, chosen because go.org was already taken. According to its creators, “Go is an open-source programming language that makes it easy to build simple, reliable, and efficient software.”
Go consists of a few simple, orthogonal features that can be combined in a small number of ways. While this means not everyone’s favorite features are included, it also means it’s easier to learn and program with Go compared to more complex languages. Golang was designed for code efficiency to run faster software and apps. Other benefits of Go include:
- Speed: Because Go is a compiled language, its code is directly translated into formats that processors understand — without the extra steps of bytecode and virtual machines.
- Concurrency: Concurrency is a built-in part of Go with the use of features such as Goroutines, the select statement, and channels.
- Simplicity: Golang is easy to learn and easy to read. Its syntax is small compared to other languages, minimizing the need to constantly look up things.
Despite its benefits, Golang may not be the best choice if you’re more familiar and comfortable with other languages. Python, Java, and C# are also often used for web scraping. We’ll compare the advantages and disadvantages of each below so you can decide which option you prefer.
Go vs Python
Python is one of the most popular programming languages. Python is an interpreted, dynamically typed language. While not as simple to learn as Go, it’s relatively easy to learn and comes with an extensive collection of libraries. It can also be a good choice for building a web scraper, depending on your experience and objectives.
- Supports concurrency
- Easier to read
- Automatically generated documentation
- Garbage collector
- Static typing
- No error handling
- Less responsive community
- Doesn’t support generic functions
- Primary language of data scientists
- Large support library
- Easier for beginners
- High performance
- Line-by-line execution
- No built-in concurrency mechanism
- Shows more errors at run time
Python is a general-purpose programming language that was first released in 1991. It’s widely used in data science projects and has an active, responsive community. The libraries and community available for Python can make things easier for you if you’re just getting started, so this language comes out on top in terms of libraries and resources.
Because Go was built to handle concurrencies, it trumps Python in terms of scalability. The fact that it’s a statically typed language means errors are found quicker and earlier in the process of coding. As far as syntax, Golang’s is simpler but not as beginner-friendly as Python’s. If you’re interested in web scraping with Python, read our complete guide here.
Go vs Java
Java is one of the most popular programming languages in use today. It was released in 1995, making it almost as old as Python. As you would expect from an older language, it has a lot of support and resources available for users. It’s a general-purpose programming language that’s object-oriented and uses classes.
Because it’s a compiled language, it uses a virtual machine to interpret code and detect errors. Java’s large collection of libraries makes it easy to find pre-written code for many projects. Java and Golang are C-family languages, so they share similar syntax.
- Smaller library
- Code is compiled directly into binary file
- Easier to read
- More platform-dependent
- Reflection is less obvious and more complicated
- No ad-hoc polymorphism
- Bigger community
- Less platform-dependent
- Object-oriented nature enhances reusability
- Requires complex codes
- No backup facility
- Uses significant memory space
Java isn’t the king of programming languages like it used to be, but it’s still widely used all over the world. Java is more popular in data science applications at this time, although that may change in the future.
Go beats Java in almost all benchmark tests due to its faster compilation speed and more compact code. Java’s concurrency measures are clunky and eat up a lot of memory compared to Go. Unless you’re well-versed in Java programming, you’ll probably be better off using Go for your web scraping.
Go vs C#
C# is also rooted in the C-family of programming languages. It was developed by Microsoft in 2000. Though it started as a closed-source language, it is now open-sourced and cross-platform. C# and Go compile the code into binaries, but C# requires the use of .NET framework to run the binaries. Both C# and Go perform much faster than interpreted languages such as Java and Python.
- Easier automatic documentation
- Built-in support for unit testing
- Simpler to read and write
- Lower resource usage
- Less compatibility with 3rd-party tools
- Not as versatile
- Native Graphical User Interface solution
- More features
- Performs exceptionally well on Windows
- Large support community
- Takes longer to learn
C# is similar to Golang in many ways. They both have the performance benefits that come from not having to be interpreted. However, C# is much clunkier and more difficult to learn than Go. If you’re already familiar with C# and have the resources it needs to run effectively, it may be a good choice. Otherwise Go is probably a simpler solution for building a web scraper. If you do decide to go with C#, you can find our complete guide to web scraping with it here.
Web Scraping with Golang
All web scrapers work by reading the HTML of a website. They then collect the specific data that you’ve asked for and export it into a readable format. If you’re building your own web scraper to customize your data mining solution, you need to begin with a comprehensive data strategy.
Start by deciding exactly what your goals are for your web scraping project. The options are almost limitless, so it helps to focus.
Some things to consider include:
- Will you be scraping for product comparisons such as price or features?
- Do you want to scrape as a way of finding out how to reach potential customers?
Maybe you’re more interested in monitoring your brand mentions and customer sentiment. Although you will probably want to do all of these at some point, it’s a good idea to start with one focused project.
As you delve deeper into web scraping, you’ll be able to fine-tune your web scraping to collect the data that’s most valuable to you at any given time. Some data scraping projects are likely to be ongoing, such as monitoring your competitor’s prices. Meanwhile, others may relate to short-term objectives, such as determining the focus of your next advertising campaign.
Before you get started
Once you’ve narrowed down exactly what data you want to scrape, it’s a good idea to get familiar with the website you’ll be starting with.
Begin by opening the webpage in the browser you’ll be using with your web scraper. We’ll be using Chrome in the examples for the rest of this article. Turn on the developer tools so that you can see the HTML structure of the website. You’ll need to know where the data you’re interested in is located in the HTML structure of the webpage. When you’re familiar with that, you’ll have an easier time building your web scraper.
Building a Golang Web Crawler
The minimalism of Go makes it one of the simplest languages in which to program a web scraper. Go’s secret weapon is Colly, a “fast and elegant scraping framework for gophers.” “Gophers” are Golang aficionados. Colly is a “batteries-included” solution for web scraping. It comes with all the tools you need out of the box. It’s a free, open-source solution for data collection.
Colly also comes with built-in options that support web crawling best practices for being a “good neighbor.” It comes with out-of-the-box functionality for supporting best scraping practices, like rate-limiting, parallel crawling, and respecting robots.txt.
Colly is a Go package designed for extracting structured data from websites. The “collector” manages data extraction in Colly. The collector is configurable, which allows you to modify and limit certain aspects of your program. It will call the basic website. Collectors can have callbacks attached to them that execute at different times.
Here are some of the elements you’ll be using frequently with Colly:
OnHTML lets you target a specific HTML identifier. This is where your earlier research will pay off. You’ll need to know where your data is stored in the HTML structure of the page you’re scraping. Using this, you’ll create a struct that represents the data you’re collecting. Your OnHTML will have two parameters. The query parameter will tell Colly what to look for, and the callback function will take a Colly HTMLElement. OnHTML will run every time it encounters a value that matches the query parameter.
This contains the matching HTML data that was found by your scraper in the callback.
As its name implies, GoQuery is modeled off jQuery. GoQuery allows more complicated scraping than HTMLElement, such as finding the siblings and parent of the anchor element. If you’re used to scraping with Python and Beautiful Soup, you’ll probably feel at home with GoQuery. However, most scrapers can be built without GoQuery since most tasks can also be accomplished using an OnHTML callback for HTML to access the entire page.
If you’re familiar with programming in Go, it’s simple to create a web scraper. Even if you’re new to Go, it’s not that hard, although you’ll need to familiarize yourself with it a bit more to make sure you’re comfortable and can achieve your goals.
Before you can start programming, you’ll need to install Go. Luckily, it’s a lightweight language so it won’t take up much room. You can use any text editor to program in Go. Create a folder for your web scraper and create a “main.go” file to serve as the starting point of your scraper. We’ll be using Go modules to run Colly. Go modules are the official dependency management solution for Go.
Install Colly by running the following command:
go get github.com/gocolly/colly
Coding your Go web scraper
Every website is different, so you’ll need to configure your scraper based on how the site you’re scraping is set up and what data you want to extract. For a basic scraper, the following steps will get you started:
- 1. In your main.go file, set up a package main and a funct main.
- Create your collector with the command: c := colly.NewCollector()
- Add your OnHTML callback function and query parameter based on HTML attributes of the data you want to scrape.
- Set any other callbacks you want, including OnError, OnResponse, OnRequest, etc. Regardless of where they are in your code, callbacks will run in a specific order listed here.
- Call on the “Visit” function to tell your scraper what website to scrape.
Once your scraper is ready to go, you’ll want to export your data to a readable format. You can do that with Go using the JSON package and the following steps:
- Import the JSON package, the ioutil package, and the os package.
- Create a function called “writeJSON” for your data parameters and MarshalIndent.
- Use “writefile” to specify your file, granting a permission code of 0644 to create the file if it doesn’t already exist.
Use Cases for Scraping Web Pages with Go
Once you’ve got your scraper up and running, there’s no end to its uses. You likely already have a goal in mind with web scraping.
Here are some additional ideas to get you started:
Keep tabs on what’s being said about your brand across the internet and on social media. With so many customers relying on reviews before they make a purchase, knowing how your products are perceived is vital to making sales. Monitoring your brand allows you to understand what your customers love about your products as well as what you can improve.
Brand monitoring also helps you understand how your company is handling customer service. Customer service can make or break your business and tracking customer sentiment gives you the opportunity to deal with small issues before they go viral for all the wrong reasons.
Research your competition
One common use of web scraping is to monitor your competition. By finding out your competitor’s offerings and prices, you can make informed decisions about pricing as well as products and features.
You might not necessarily want to lower your prices if you find out your competitor is charging less, but you may want to highlight the features you offer that they don’t.
Understanding where your ideal customer spends their time and money can help you tailor your marketing strategy to reach them. You can also find out what their priorities are and how to talk to them in a way that echoes those priorities. Scraping social media sites and web forums with keywords that are relevant to your avatar is a great way to do this. This type of fine-grained research will give you the data you need to reflect your customers’ language back to them.
Another aspect of marketing research that’s easily accomplished with web scraping is new product research. By scraping reviews of your products and keywords related to ideas you’re developing, you can find out exactly what your customers want and provide it to them. You won’t have to wonder if there’s a market for your product design team’s latest idea, and you’ll be able to know if it’s feasible based on extensive data collection.
Search engine optimization (SEO)
An optimal search engine optimization (SEO) strategy is crucial if you want to rank high on search engines. By monitoring trends and using the data to extract relevant keywords, you can target your content creation efforts to rank high for specific keywords relevant to your industry.
If you’re in the travel industry or another type of industry that deals with aggregated content, you can use web scraping to pull together data from many different sources and offer it all in one place for your customers.
You might want to curate a travel itinerary to includes door-to-door booking options. Businesses in the music industry may want to pull together themed playlists for listeners. Web scraping allows you quickly gather data from many different sites that would be tedious and time-consuming to do manually.
Ethical Web Scraping with Golang
The biggest benefit to web scraping is that it processes data far faster than any human could because you’re using automated software, or bots, to perform tasks. This is also the reason it’s important to be a good digital citizen when you’re scraping.
Your scraper makes requests of the servers it visits, and the website you’re scraping must allocate resources to deal with those requests. Too many requests can crash a server, which is why most websites have implemented anti-scraping measures to deal with bots.
Another reason so many websites have anti-bot policies is to thwart malicious actors. Whether it’s copying a website wholesale or scraping for private data, some web scrapers have nefarious intent. This is not only harmful to people who have their private data stolen but can also lead to hefty fines for corporations involved in data breaches. It makes sense for websites to take measures to protect their data.
However, you obviously are scraping data with pure intent to boost your business, so you’ll make sure your data scraping efforts are above-board. Go makes it easy to ethically scrape data by offering tools specifically for that purpose.
Using the following guidelines to make sure you aren’t inadvertently harming the websites you visit:
Only scrape publicly available information
This goes without saying, but you should only attempt to scrape the type of information you could find if you were to manually visit a website. Not only is scraping private data bad ethics, it’s also illegal.
Check public APIs first
Accessing an Application Programming Interface (API) isn’t the same as scraping a website, but it is a way to gather public data from a website. APIs give you direct access to data. Many large websites offer APIs, either free or paid, to allow other developers to access their data sets and use them for whatever purpose they choose, including creating their own software, research, etc.
API requests aren’t as overwhelming for a website as scraping requests. There’s a dedicated method for extracting the data, and it’s done with a single request. Not all websites offer APIs. Even among those that do, the data you’re interested in may not be available via the API. However, it’s always a good idea to check the API before you start scraping.
Limit your speed
It may seem counterintuitive to limit the speed of your web scraper, particularly since one of the main benefits of Go is that it’s faster than other options. However, you don’t want to crash a server by sending too many requests at once.
Even if you don’t crash the server, making too many requests too quickly will get you banned. You can use the “Limit” method in Colly to set a random delay that will slow down your scraper’s requests. You can also use this method to limit how many requests will be made at the same time:
- RandomDelay allows you to set a random delay between each request that won’t exceed the time you set.
- Parallelism allows you to determine the maximum number of requests that are carried out at the same time.
Follow the rules
Many websites have a robots.txt file that tells exactly how they want their data handled. There’s no standard format for robots.txt files, but they are human-readable files that identify which parts of the website are available for scraping and which aren’t, as well as any rules they want you to follow as far as issues such as crawl delay.
Check the robots.txt file for every website you plan to scrape and respect their rules. One of the environmental variables you can use to configure your collector is “COLLY_IGNORE_ROBOTSTXT.” Make sure this is set to “no” if you’re setting up your collector this way.
Scrape during down times
Most sites have times when they’re less busy, and have cyclical downtimes, which are often when people who normally use them are sleeping. By scraping during early morning or late evening hours, you can help avoid overwhelming the server. Avoid scraping during particularly busy times when the site is likely to be slammed with human users.
Only scrape for data you need
While you can set up a scraper to collect all the available data from a website, it’s more considerate to only scrape the information you need and will use. Just because you can get more data doesn’t mean you should. Sending thousands of requests to extract information you won’t even use ties up the website’s server needlessly. It also makes analyzing data into usable insights more cumbersome on your end if you’re having to wade through piles of irrelevant information.
How to Avoid Common Problems When Using Your Go Scraper
If you’ve done any programming before, you’ve undoubtedly encountered times when you thought everything was set up perfectly — only to have your program return something completely unexpected or just sit there doing nothing. The same thing happens with web scraping.
While it’s easy to build a simple scraper, the many different variables that go into scraping make it a complicated process. Here are some of the most common issues you’re likely to face when scraping and what you can do to fix or avoid them:
A honeypot is a trap designed to catch robots. It’s a link in a website’s code that is only visible to bots. To your scraper, it looks like any other page to scrape but following it will alert the website that your scraper is a bot and should be banned. One way you can avoid honeypots is by searching for them when you’re examining the website before you scrape it.
Honeypots may have a link with a “no follow” tag, or they may be camouflaged in the same color as the background. One WordPress plug-in creates a “black hole” directory for honeypot links. Finding and avoiding these links when setting up your scraper can help you dodge honeypots.
A Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) is a standard security measure to detect robots. If you’ve ever had to identify the number of traffic signs in a blurry, gridded picture, you’ve encountered a CAPTCHA. Unfortunately, there’s no way to program your web scraper to deal with CAPTCHAs. However, there are steps you can take to avoid encountering them in the first place.
They’re often triggered by bot-like activity, so the best way to avoid them is to make sure your bot behaves as humanly as possible. This can be accomplished by using Colly’s limit method and using proxies. Proxies are an important part of successful web scraping, and we’ll go into them fully in another section.
User agent blocks
A user agent string gives a website information about:
- What device you’re using to access their site and information about what browser you’re using
- What version you’re using
- What your operating system is
Websites use this information to responsively deliver content to your device. However, they also use it to filter out bots.
The default user agent string used by many web crawlers identifies them as bots. There’s no point in taking the time to implement human-like measures in your web crawler if your user agent string is simply going to say, “Hi, I’m a bot,” as soon as you make a request.
Colly ships with an extension that enables a random browser user agent on every request with the following command:
func RandomUserAgent(c *colly.Collector)
As with user agents, your internet protocol (IP) address tells a website a lot about you. In addition to your general location, your IP address identifies your device, at least as long as you’re in the same location. Perhaps the most common anti-bot strategy websites use is blocking an IP address when it issues too many requests too quickly. Once your IP address has been blocked, you can no longer access the website.
The best method for avoiding IP bans is using proxies, which swap out your IP address to avoid looking like all your scraping comes from the same location. We’ll soon get into more detail about which proxies are the best to use while web scraping.
Getting the Most Out of Your Golang Scraper
Now that you know some of the common problems you’re likely to run into when scraping with Go, we’ll discuss some ways to make scraping with Go as fast and efficient as possible.
Although a scraper will scrape one page at a time much faster than a human can, you’re missing out on the power of this tool if you’re limiting it to a single page at a time.
Most websites indicate the page number via the URL. You can use a “for” loop to scrape multiple pages by setting a variable for the loop that begins at “1” and ends at less than or equal to the last page number. You can then use the “fmt.Sprintf” command to include the loop in the URL that’s specified in your visit.
Colly blocks when a request isn’t finished by default. When you’re setting up a NewCollector, you can turn this feature off by using “colly.Async(true).” This allows Colly to send a new request before the last one ends. This greatly speeds up your process. You need to use “c.Wait()” with async to make the crawler wait until all concurrent requests are performed.
Mimic human behavior
Creating a web scraper is a balancing process that involves making it work as quickly as possible to extract the data you want while also trying to mimic human behavior as much as possible.
We’ve already touched on two ways to do this in Colly using the limit method. Programming in a delay and limiting the number of parallel requests will slow your scraper down a bit — but a site detects your scraper as a bot, it will come to a screeching halt and you’ll be blocked. Ultimately, you may give up some speed, but you’ll increase your efficiency by avoiding bans.
A proxy IP address hides your real IP address from the website so that it can’t tell you’re sending thousands of requests at once. A proxy acts as a go-between so you can send your request to the proxy, which then sends the request on to the website.
Of course, you can’t just substitute one static IP address for your current IP address, or it will get banned just as quickly. You’ll need to use a rotating pool of IP addresses. Colly makes it easy to use proxies with the proxy switcher package. This implements the RoundRobinProxySwitcher method to send requests through the proxies and use a new proxy for each request.
Choosing the Best Golang Proxy
As we touched on above, you’ll need a rotating pool of proxies to be able to effectively use your web scraper. If you’re aware there are free proxies available on the internet, you may be thinking that will be an easy solution. Unfortunately, it’s not that simple. As with other aspects of web scraping, proxies can be complicated, so we’ll talk about the different types and which use cases they’re best for in this section.
Like most things in life, you get what you pay for when it comes to proxies and web scraping. Free proxies are completely unsuitable for web scraping for several reasons.
The most important reason is that they aren’t secure. Free proxies have been known to be used by hackers to access your data. In addition to the security issues, free proxies are publicly available so it’s impossible to know how many users are accessing them at once.
This overload causes them to be slow and perform poorly. You also have no control over who else is using the same proxy IP address as you, so you’re likely to get banned at any time by someone else’s bad behavior. This puts you at a potential roadblock for being unable to access the sites and data you need.
Data center proxies
Data center proxies originate in data centers. These are the most common type of proxies. Data center proxies are fast, economical, and plentiful. They hide your IP address, but they’re easily identified as data center proxies. Because of this, they’re more likely to get banned.
Some benefits of data center proxies include:
Data center proxies aren’t ideal for web scraping but may be a good option if your budget is limited. Because data center proxies are more likely to get banned, it’s important to make sure your proxy provider offers a lot of subnet diversity. Some websites will ban an entire subnet if they detect bot-like activity from a data center IP address.
Blazing SEO offers seven unique autonomous system numbers ( ASNs) and 20,000 unique C-class subnets for data center proxies that are available from 29 countries. We have over 300,000 data center IP addresses and offer unlimited bandwidth and free replacements.
Rotating residential proxies
When it comes to web scraping, rotating residential proxies are the gold standard. Residential proxies are issued by an internet service provider (ISP) and linked to a specific address. They look exactly like a normal user’s IP — because they are.
Residential proxies are much less likely to get banned than data center proxies. And if one of your IPs does happen to get banned, you’re using a pool of them — so you can just swap in another and keep on scraping. Residential proxies are more expensive than data center proxies but result in more effective and efficient web scraping since you’ll have less downtime. Large-scale scraping projects will experience the most success with residential proxies.
Blazing SEO’s residential proxies are the highest quality solutions for your web scraping needs, providing the authority, reliability, and ethical practices that protect you from bans. Learn more about how we can help you today.
You’ve learned a lot about scraping in Golang, including why it’s a great option, what you need to know before you start scraping, the basic outline to follow in creating a Go web scraper, and how to avoid the most common pitfalls when you start web scraping. The next step is to choose Blazing SEO residential proxies to help implement your full-scale enterprise data strategy.
Blazing SEO’s world-class customer support team is standing by to help provide the solutions you need. Reach out to get started today and find out how far we’re willing to go to earn your business.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.
Start a risk-free, money-back guarantee trial today and see the Blazing SEO
difference for yourself!