Scrape Amazon Reviews: eCommerce Insights And Smart Data Collection

The expanding e-commerce industry is hungry for more complex analytical techniques to help study customer temperament, predict market trends, and get a solid edge over a never-ending list of competitors. Yet to maximize the potential and the strength of whatever research methods you might be using, you invariably need high-quality data you can rely on.

Amazon is the undisputable champion when it comes to selling and buying online. That’s why it’s an excellent source of alternative data and analytics for retailers. Business owners scrape data from Amazon all the time to compare prices, product descriptions, and customer reviews.

‌Scraping Amazon reviews is incredibly handy when performing sentiment analysis to find new market opportunities. A properly done Amazon review analysis gives insightful information that entrepreneurs and established business owners can use to create wiser strategies that advance their service and revenue.

In a time when scraping the web is easier than ever, business owners should take advantage of how quickly they can gather all sorts of information. Currently, there are dozens of Amazon reviews scraping tools available online. If you’d like a more hands-on approach, programming your own in your preferred coding language is always an option. Read on to learn how to scrape Amazon reviews like a pro in a few simple steps and find out the primary benefits of collecting this type of information.

Rotating Residential Proxies: Best Proxies for Scraping Amazon Reviews

how to scrape amazon reviews

Scraping Amazon reviews is becoming more common by the day. However, high-volume data extraction is still frowned upon by many sites and could make the online retail giant mistake you for a malicious actor. Using a proxy — preferably a rotating residential one — is an excellent way to avoid being caught mid-research and being blacklisted as you scrape Amazon reviews left and right.

A rotating proxy assigns a new IP address from a proxy pool for every connection made. This means you can send thousands of requests at the same time and each of them would have a different IP address — making your connection requests inconspicuous to Amazon. IP rotation takes your Amazon review analysis to the next level, and it will protect you from subnet bans.

Unlike data center proxies, residential proxies are associated with an internet service provider. This sourcing method improves the proxies’ efficiency by making it seem as though you’re using connections that come from individual residences.

How To Scrape Amazon Reviews

how to scrape amazon reviews

Before you learn how to grab product review data from Amazon, you must have a holistic idea of what you’ll be working with.

Here are the different stages involved in the Amazon review data collection process:

1. Analyze the site’s HTML structure

Scraping requires finding a solid pattern in the site you’re gathering data from and extracting it. Before you start coding your Amazon reviews scraping tool, you need to understand the HTML structure of the review page and identify patterns regarding IDs, usage of classes, or any other HTML element that repeatedly appears in the site’s code.

2. Implement syntactic analysis of the code

Once you’ve analyzed the site’s structure, you must start working on the code you’ll be implementing. This means visiting Amazon and using a parser to compile or interpret data and break it into smaller elements that can easily be translated into the programing language of your choice.

3. Collect and store information

The sequential tokens, interactive commands, and other elements obtained from the parser can easily be extracted in your preferred format. You could use CVS or JSON as the final output for your scraped Amazon reviews to reside.

Best Amazon reviews scraping tools available

If coding is not your cup of tea — or you’d much rather invest your time in other business endeavors — you can always resort to using Amazon reviews scraping tools that will do the hard work for you before you can Google “how to download Amazon reviews.” This software can take the form of browser extensions or dedicated Amazon reviews scraping programs.

Browser extensions

These add-ons are readily accessible for pretty much anyone with an internet connection. They typically have basic functions that are sufficient for some casual Amazon reviews scraping activity. This could be ideal for small businesses that don’t need high-volume data extraction — but might not be the best option for larger-scale companies.

Scraping software

Using dedicated scraping software is the way to go for those in need of scraping Amazon regularly. These frameworks help prevent the most common nuisances that come with trying to gather large amounts of data — think captchas, login walls, IP bans, pagination, etc.

Building an Amazon review scraping tool: What programming language should you use?

If you’re looking for a more hands-on approach that allows you to personalize your Amazon reviews data analysis, you could build your own scraping tool. But what language should you use? The best programming language is the one you’re already familiar with, especially when web scraping. Having previous experience in programming will make it easier for you to find pre-built resources that support scraping in your language of choice. It will also speed up the process as you won’t have to go through the learning curve.

‌Luckily, web scraping is not a task that requires you to start everything from scratch. There are numerous third-party libraries dedicated to web crawling that will help you ease the coding work. Keep in mind that regardless of how much of an experienced coder you are, web scraping still involves a variety of problems you’ll need to bypass with the coding language and framework you pick.

When selecting the ideal programming language to scrape your Amazon reviews data with, make sure it’s flexible and easy to code with. Some other elements to look for include:

  • Operational ability to feed database
  • Scalability
  • Maintainability
  • Crawling efficiency

Try not to overthink the role your programming language will play in the overall speed of your Amazon reviews scraping activity. The main factor that will affect how fast you can gather data is the I/O (input/output) mechanism of the site. After all, web scraping is about sending connection requests and waiting to receive a response.

Here are some of the most commonly used programming languages for web scraping:

Python

‌This all-rounder programming language is an old-time favorite for web scrapers. Paired with a good framework, it can make the whole crawling process a lot smoother. Make sure the one you pick has useful features, like the ability to convert incoming documents to Unicode and outgoing documents to UTF-8.

Node.js

‌This language is perfect for crawling sites with dynamic code practices. Node.js supports distributed crawling but it has weak communication stability, which makes it hard to work with when dealing with large-scale projects.

C & C ++

‌These programming languages offer great overall performance. However, they’re costly to keep up with — which makes them less viable if you don’t manage a company that exclusively focuses on data analytics, research, and web scraping.

Step by step instructions for building an Amazon review analysis tool using Python

Python is the undisputed champion when it comes to finding the most suitable programming language for an Amazon reviews scraper. To make things easier when creating your own web crawler using Python, you’ll need to download and install a solid framework. In this example, we’ll use Scrapy, which is open-source and specifically targets web scraping.

There are two ways of installing this framework, but first, make sure you already have Python all ready to go. Here’s how you can download and install this programming language:

  1. Go to Python.org and download the latest version of Python
  2. Run the installer. Make sure pip — a package management system used to install and manage Python-based software — is selected as an optional feature, which typically happens by default, but it doesn’t hurt to double-check. 
  3. Select “Add Python3.6 to PATH” to ensure Python environment variables are added to your PATH and Python, and pip are accessible via PowerShell or Command Prompt (This step is crucial to run scripts from the command line using “python<script>.”).
  4. Disable PATH length Python limit.
  5. Close the window.
  6. Open PowerShell or Command Prompt.
  7. Type “python –version” and press enter.
  8. The version of Python should appear below to verify successful installation.
  9. Type “pip -V” to verify pip was successfully installed.

Installing Python packages for web scraping

To install Python packages with pip, all you need to do is open PowerShell or Command Prompt and type:

pip install <pypi package name>

The most popular ones are:

  1. BeautifulSoup

This framework allows you to pull out data from HTML and XML files. It works with your preferred parser when providing idiomatic ways to navigate, search, and modify the parse tree. To install BeautifulSoup, type in this command:

pip install BeautifulSoup4

  1. LXML

This easy-to-use package has the most features available for processing XML and HTML in the Python language. You can use it to parse HTML content downloaded from web pages. When converting it into a tree-like structure, it can be navigated using semi-structured query languages — think XPaths or CSS Selectors. To install it, you need this command:

pip install lxml

  1. Requests

This framework is basically HTTP for Humans. Python has its own HTTP libraries, but Requests minimizes the manual labor of pulling it down. It lets you automate the process of sending organic HTTP/1.1 requests. This means you’ll no longer need to manually input query strings to your URLs. You can install this package using: 

pip install requests

Building an Amazon review analysis tool using Python

Next, you can use pip —a package management tool for Python. 

$ pip install Scrapy

Or with Anaconda, you can use Conda to install it:

conda install -c conda-forge Scrapy

Once Scrapy is good to go, create a folder that will contain your application and run this command inside of it to create a new project:

scrapy startproject Scrape_AmazonReviews

Next, open the folder as a workspace in your preferred editor and create a spider. This will be the program that does the actual scraping and crawls through the Amazon reviews page’s URL to parse relevant data using XPath. This is the command to generate a new spider:

 scrapy genspider spiderName your-amazon-link-here

After we create the spider, we will take a look at folder structure and supporting files. Once your spider’s ready, the folder structure should look like this:

 ├── scrapy.cfg           # deploy configuration file
└── Scrape_AmazonReviews # newly created project’s Python module
    ├── __init__.py
    ├── items.py         # project items definition file
    ├── middlewares.py   # project middlewares file
    ├── pipelines.py     # project pipeline file
    ├── settings.py      # project settings file
    └── spiders          # a directory where spiders are located
        ├── __init__.py
        └── example.py   # spider we just created

You’ll also have the basic skeleton of the spider ready for you to start coding. It will look like this:

└── Scrape_AmazonReviews # newly created project’s Python module

# -*- coding: utf-8 -*-
 import scrapy
 
 class AmazonReviewsSpider(scrapy.Spider):
     name = “amazon_reviews”
     allowed_domains = [“https://www.amazon.com/product-reviews/B01IO1VPYG/ref=cm_cr_arp_d_viewpnt_lft?pageNumber=”]
     start_urls = (
         ‘https://www.amazon.com/product-reviews/B01IO1VPYG/ref=cm_cr_arp_d_viewpnt_lft?pageNumber=/’,
         )
 
     def parse(self, response):
         pass

Before coding the spider, you need to identify the patterns you want to extract from the site. In this case, you’ll look for the XML patterns for the Amazon reviews page and then inspect ratings, comments, and reviews. Once that step is completed, you’ll extract the data classes that are used to showcase the reviews’ details. You’ll need to scroll through the pages to obtain them.

At this point, you’ll be ready to set the base URL and add the number of pages you want to crawl. Then you can used the classes as identifiers. This is the cocode to extract that data:

# -*- coding: utf-8 -*-
 
# Importing Scrapy Library
import scrapy
 
# Creating a new class to implement Spide
class AmazonReviewsSpider(scrapy.Spider):
     
    # Spider name
    name = ‘amazon_reviews’
     
    # Domain names to scrape
    allowed_domains = [‘amazon.in’]
     
    # Base URL for the World Tech Toys Elite Mini Orion Spy Drone
    myBaseUrl = “https://www.amazon.com/product-reviews/B01IO1VPYG/ref=cm_cr_arp_d_viewopt_sr?pageNumber=”
    start_urls=[]
    
    # Creating list of urls to be scraped by appending page number a the end of base url
    for i in range(1,5):
        start_urls.append(myBaseUrl+str(i))
    
    # Defining a Scrapy parser
    def parse(self, response):
            #Get the Review List
            data = response.css(‘#cm_cr-review_list’)
            
            #Get the Name
            name = data.css(‘.a-profile-name’)
            
            #Get the Review Title
            title = data.css(‘.review-title’)
            
            # Get the Ratings
            star_rating = data.css(‘.review-rating’)
            
            # Get the users Comments
            comments = data.css(‘.review-text’)
            count = 0
             
            # combining the results
            for review in star_rating:
                  yield{‘Name’:”.join(name[count].xpath(“.//text()”).extract()),
                      ‘Title’:”.join(title[count].xpath(“.//text()”).extract()),
                      ‘Rating’: ”.join(review.xpath(‘.//text()’).extract()),
                      ‘Comment’: ”.join(comments[count].xpath(“.//text()”).extract())
                     }
                    count=count+1

Once your spider is successfully built, you’re ready to save the extracted output using the runspider command. This will take the output of your spider and store it into a CVS, XML, or JSON file. To select your favorite format, use “-t” like in the example below:

scrapy runspider spiders/filename.py -t txt -o – > amazonreviews.txt

To extract the output into the CVS file, open Anaconda and run the following command:

scrapy runspider spiders/AmazonReview.py -o output.csv

Using other frameworks involves a similar process. Here’s another example using Selectorlib. You can install it with the following command on pip:

pip3 install python-dateutil lxml requests selectorlib

The following steps come from a tutorial by Scrapehero. Create a file called “reviews.py” and paste this Python code into it:

from selectorlib import Extractor
import requests
import json
from time import sleep
import csv
from dateutil import parser as dateparser
# Create an Extractor by reading from the YAML file
e = Extractor.from_yaml_file(‘selectors.yml’)
def scrape(url):    
headers = {
‘authority’: ‘www.amazon.com’,
‘pragma’: ‘no-cache’,
‘cache-control’: ‘no-cache’,
‘dnt’: ‘1’,
‘upgrade-insecure-requests’: ‘1’,
‘user-agent’: ‘Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36’,
‘accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9’,
‘sec-fetch-site’: ‘none’,
‘sec-fetch-mode’: ‘navigate’,
‘sec-fetch-dest’: ‘document’,
‘accept-language’: ‘en-GB,en-US;q=0.9,en;q=0.8’,
}
# Download the page using requests
print(“Downloading %s”%url)
r = requests.get(url, headers=headers)
# Simple check to check if page was blocked (Usually 503)
if r.status_code > 500:
if “To discuss automated access to Amazon data please contact” in r.text:
print(“Page %s was blocked by Amazon. Please try using better proxies\n”%url)
else:
print(“Page %s must have been blocked by Amazon as the status code was %d”%(url,r.status_code))
return None
# Pass the HTML of the page and create
return e.extract(r.text)
with open(“urls.txt”,’r’) as urllist, open(‘data.csv’,’w’) as outfile:
writer = csv.DictWriter(outfile, fieldnames=[“title”,”content”,”date”,”variant”,”images”,”verified”,”author”,”rating”,”product”,”url”],quoting=csv.QUOTE_ALL)
writer.writeheader()
for url in urllist.readlines():
data = scrape(url)
if data:
for r in data[‘reviews’]:
r[“product”] = data[“product_title”]
r[‘url’] = url
if ‘verified’ in r:
if ‘Verified Purchase’ in r[‘verified’]:
r[‘verified’] = ‘Yes’
else:
r[‘verified’] = ‘Yes’
r[‘rating’] = r[‘rating’].split(‘ out of’)[0]
date_posted = r[‘date’].split(‘on ‘)[-1]
if r[‘images’]:
r[‘images’] = “\n”.join(r[‘images’])
r[‘date’] = dateparser.parse(date_posted).strftime(‘%d %b %Y’)
writer.writerow(r)
# sleep(5)

Your Amazon product review scraper should be able to:

  1. Read a list of product review pages from a file called urls.txt.
  2. Use a YAML file from Selectorlib to identify relevant data on an Amazon page.
  3. Save the YAML data in a file called selectors.yml.
  4. Scrape the data.
  5. Save the output data as CSV spreadsheet.

Once you have the template ready, you can move on to using the Slectorlib Chrome extension. Click on “Highlight” to preview your selectors, and use the  “Export” button to download the YAML file.

The template in selectors.yml should look like this:

product_title:
css: ‘h1 a[data-hook=”product-link”]’
type: Text
reviews:
css: ‘div.review div.a-section.celwidget’
multiple: true
type: Text
children:
title:
css: a.review-title
type: Text
content:
css: ‘div.a-row.review-data span.review-text’
type: Text
date:
css: span.a-size-base.a-color-secondary
type: Text
variant:
css: ‘a.a-size-mini’
type: Text
images:
css: img.review-image-tile
multiple: true
type: Attribute
attribute: src
verified:
css: ‘span[data-hook=”avp-badge”]’
type: Text
author:
css: span.a-profile-name
type: Text
rating:
css: ‘div.a-row:nth-of-type(2) > a.a-link-normal:nth-of-type(1)’
type: Attribute
attribute: title
next_page:
css: ‘li.a-last a’
type: Link

Now all you need to do is add the URLs you’re scraping into a text file called urls.txt in the same folder. To run the scraper, use the following command:

python3 reviews.py

Best Practices to Scrape Amazon Reviews Successfully

amazon review data

Web scraping is generally frowned upon by most sites. Keep in mind that they might have a hard time differentiating hackers from genuine researchers like you. To minimize the risk of getting banned from the site, always follow the best practices and tactics to make your Amazon review scraping endeavors run as smoothly as possible.

When attempting to scrape Amazon reviews data, you will likely encounter some of the most common challenges that all web scrapers face. Their site discourages any scraping activity — both in policy and site structure. Amazon strives to protect its data at all costs, and the anti-scraping measures they’ve implemented might give your scraper a hard time extracting the information you need.

Amazon detects bots and blocks their IPs

The e-commerce giant closely monitors the behavior of browsing agents and can easily identify your scraper bot’s actions when crawling through a browser. For example, if your URLs are changed at a regular interval by a query parameter, you’ll give yourself away. Amazon will then try to use captchas and IP bans to prevent your bot from collecting more data in an attempt to protect the privacy and integrity of the information.

To avoid these issues, you can:

  • Rotate your IPs.
  • Deploy a consumer-grade VPN service.
  • Include aleatory time gaps and pauses in your scraper code.
  • Remove all query parameters from the URLs.
  • Alternate your scraper headers.

Amazon uses different structures for its product pages

The variety of products sold on Amazon is vast — to say the least. We’re talking about the world’s biggest online retailer here, so it’s normal for them to implement different site attributes to highlight the key features of what they’re selling. Attempting to scrape data from Amazon might lead you to unknown response errors simply because your scraper might be designed and customized for a particular web structure. When this structure changes, your scraper will most likely fail unless you design it to handle exceptions.

Your scraper’s code needs to be resilient to keep up with the ever-changing page structure patterns. Try including a “try-catch” phrase in your code to keep errors and exceptions at bay. The good news is that you’ll be identifying the attributes you’re looking for before you even start coding, so it should be easy to design a code that can look for a specific attribute if you use string matching tools.

Your Amazon review scraping methods might be inefficient

If you don’t take care of the efficiency and speed of your algorithm from the very start, you might end up with a scraper that spits out information you don’t need. This will result in hundreds of thousands of rows of useless, irrelevant data.

You can fix this issue by doing a little math with the data you already have. Calculate the number of requests you need to send per second based on the number of products or sellers you’re extracting information about, and design your scraper to meet this condition.

Creating a multi-threaded scraper will allow your CPU to work more efficiently on each response. Remember, Amazon has a lot of information, so a single-threaded, network blocking operation will fail to keep things moving fast enough.

Your computer needs a little help

To start scraping Amazon, you’ll need to pull out the big guns. Sites like these handle massive amounts of information, so you must use high-capacity memory resources in order to keep up. You could also benefit from a cloud-based platform that can provide you with the network pipes and cores you’ll need.

If you store all the data on your PC, you’ll be putting an extra burden on your local system resources. To avoid having memory-related issues, transfer your data to a different storage platform. This way you’ll speed up your scraping tasks and prevent any frustrating system crashes amid the scraping process.

You’re not using a database for recording information

T‌he process of collecting high volumes of data requires you to stay organized. Otherwise, you’ll be swimming in an ocean of information and might experience some difficulties when attempting to perform your Amazon review analysis. A good piece of advice is to store all your records in a database table. You can later use these tables to perform basic querying operations, export your data, and more.

Is scraping Amazon reviews illegal?

In short, no — scraping Amazon reviews is not illegal. However, it’s heavily frowned upon and discouraged by the site’s policy. Amazon does not condone users collecting information in high volumes. And while web scraping is not illegal on its own, certain conditions might not be exactly ethical.

Small- and large-scale companies alike use scraping tools as a way to gather data in an inexpensive, efficient way all the time.

To keep this activity legal they must follow these rules:

  • They should not scrape copyrighted data.
  • The services of the site being scraped should not be overwhelmed by the scraper.
  • The bot should not violate the terms and conditions of the site being scraped.
  • The scraper must not collect sensitive information or any data that violates privacy and security.
  • The data should only be extracted by fair use standards.

Best practices to avoid violating any rules

There are probably numerous reasons for you to rely on Amazon reviews scraping for your business. We’re not here to judge you —  but rather to help you understand how to do this activity in the best way possible so that you can take full advantage of it.

F‌ollow the terms of service

To avoid raising any flags while performing an Amazon reviews analysis, you first need to ensure you’re not breaking the Terms of Service of the site you’re gathering data from (in this case, Amazon). If you encounter a site that clearly prohibits any kind of web scraping or data crawling, run. Do not attempt to pull data from it using automated engines. This applies even when using a rotating residential proxy.

Every site has a root directory. You can pull it out by adding “robot.txt” at the end of the URL. In this file, you can check the rules regarding web scraping since it specifies if the site allows scraping or not.

Luckily, it’s rare to find a site that absolutely prohibits crawlers and scrapers. However, if by any chance you do, you can always try and get written permission from them before you even try to scrape their data. Having a document to back you up in case of trouble is a good idea if you fear any legal reprimands.

Time your requests

Always make sure not to load too many requests at once or in a short period of time. Overburdening the website might make it look like you don’t have the best intentions — or worse, make it seem like you’re attacking them. Change the patterns of your scraping bot can also help you bypass the trend detection mechanisms Amazon can have in place.

Keep your data close

Once you’ve collected your data, ensure it won’t be copied and distributed. The data should always be for you and your team exclusively if you don’t want to incur copyright infringements. Verify the license of the data or obtain written permission of the copyright holder if you must reproduce or republish the data.

Be ready to take further action by creating a page that extensively justifies the purpose of your research. Having written testimony of what you’re trying to achieve and why you need this data to reach your goals allows you to easily explain yourself in case anything happens.

Top 3 Benefits of Scraping Amazon Reviews

benefits of amazon review data

Who wouldn’t love the ability to read their target audience’s minds to know exactly what they want and need? Looking at Amazon reviews is the next best thing — they let you know what people are saying about different products and what they expect from them. You can align what you learn from your research with your own products and strategy to provide a better service to your clientele.

However, manually searching for each product’s reviews is a time-consuming process. There are too many pages — typically thousands— and it’s hard to keep track of them and gather the exact information you’re looking for. That’s when scraping Amazon reviews comes in handy. It will allow you to get your hands on relevant information like:

  • Review date
  • Author name
  • URL
  • Header
  • Rating
  • Detailed reviews

Why scrape Amazon reviews?

Scraping Amazon reviews is a simple way to obtain important details that the official Amazon Product Advertising API leaves out. It allows you to monitor customer opinions on the exact products you sell or are looking to introduce into your store. You can also keep an eye on what your competitors are doing and learn what their clients say about the quality of the items they sell.

Here are the three main benefits of collecting Amazon reviews data:

1. Gathering sentiment analysis material

Sentiment analysis help identify the user’s emotion towards a product. This information helps sellers like you and other potential buyers understand the public’s opinion before making their own decisions about a particular item. This type of analysis can be performed over scraped Amazon reviews.

2. Keeping a close eye on online reputation

Larger-scale companies might have a hard time monitoring the reputation of their products online. Scraping reviews reveals relevant data that acts as an input they can easily use to measure their popularity to know where they stand.

3. Monitoring trends

Online retailers like you can use Amazon review data to learn more about product pricing and user tendencies. This information helps businesses further understand customer needs and keep up with them.

Use the Best Proxies To Scrape Amazon Reviews

use best proxy to scrape amazon reviews

Regularly performing Amazon review analysis is an incredible tool you can use to improve your service and business strategies. It allows you to better understand your clientele’s wants and needs and to anticipate their demands. Monitoring what your customers are saying about your competitors and the quality of their products will also give you the upper hand when introducing your own.

There are many ways to scrape Amazon reviews. This guide can help you understand some of the most common ones and the best practices to use them so that you can gather the information you need in a quick and simple manner.

Using proxies is an excellent way to avoid getting caught in all your website data scraping endeavors.

The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.

Sign Up for our Mailing List

To get exclusive deals and more information about proxies.

Start a risk-free, money-back guarantee trial today and see the Blazing SEO
difference for yourself!