Sales & data science: a step-by-step guide to competitor analysis using Python

Sunscrapers Team

16 March 2020, 11 min read

thumbnail post

What's inside

  1. Tools of the trade
  2. Data mining and preparation
  3. Q1: What are the average prices of our product type?
  4. Q2: How does pricing affect sellability? 
  5. Q3. What are the most popular screen protector brands?
  6. Q4. Which brands are the most profitable? 

What makes sellers successful on online marketplaces like Amazon or eBay? Data science helps to answer this question. 

Online marketplaces like Amazon or eBay are full of sellers competing with each other at every step. It’s not easy to increase sales when other people offer products that are very similar to yours. 

Have you ever wondered what makes buyers choose one seller over another if they all offer pretty much the same thing? 

You might be thinking that it’s all down to the price. 

So did I, but then I couldn’t resist the temptation to see whether there are other factors that come into play here.

In this article, I step into the shoes of an online seller to see how a data-driven approach using Python programming tools can help answer this question.

To do that, I took two steps.

Step 1: Find an online marketplace that is rich in content about sales history. This article was written for educational purposes, so I’d like to keep to name of the marketplace confidential.

Step 2: Come up with a method that allows comparing different offers. The trick here was finding a product that comes in many similar variants and is sold by a large number of sellers. One of the most commonly sold products on the online marketplace I picked for this study is something we all use: 

A smartphone screen protector

In this article, I’m going to ask and answer 4 questions:

Q1. What are the average prices of our product type?

Q2. How does pricing affect sellability?

Q3. What are the most popular screen protector brands?

Q4. Which brands are the most profitable?

Tools of the trade

Python - our programming language of choice for data science. 

Python tools:

  • Scrapy - this web scraping/crawling framework offers convenient features such as marshalling and preprocessing of field values. You can run it via the online scrapinghub platform, which helps to make your crawling process less taxing. 
  • Pandas - I will use it to load data into tables (and then clean, process, and analyze it). 
  • Seaborn and Matplotlib - these are some handy libraries for data visualization in Python. 

I also wrote two articles to help you work with Pandas, check them out:

How to use the Split-Apply-Combine strategy in Pandas groupby

How to handle large datasets in Python with Pandas and Dask

Data mining and preparation

1. Getting the data

The first step is finding a data source. I needed to choose an online marketplace that offers some kind of indicator of sales performance - that’s how I can evaluate it against other offer features. The platform I picked provides information about the last 100 transactions.

Note: The web crawler code content is rather extensive and will vary for different marketplace websites, so I decided not to include examples of the scraping code in this article. The Python framework for scraping and web crawling, Scrapy, offers plenty of documentation with easy-to-follow tutorials, so refer to them if needed.

Here’s a short description of how I got the data for this article:

First, I manually searched for smartphone screen protectors on the online marketplace and started the crawling process from there. Typically, a search pattern starts with some kind of an offer list where each listing points to an item-dedicated page with more information, possibly even a list of past purchases. 

Note that you can usually harvest valuable information about the product’s presentation already in the search results list (like the seller’s status, points, no. of past purchases). After all, this is the customer’s first exposure to the offer and it may impact their decision-making process. 

After the crawl, I ended up with two Pandas tables. The main table (df) had all of the product information, with each row corresponding to an item. The other table (sale_history), stored information about the past purchases, with each product including many rows of individual sale events data. 

I’ll show you the table examples later on. 

2. Processing the data

After the data extraction step, it’s time to do some cleaning and data preparation. Besides all the usual steps (removing nulls, casting columns into the right data types, etc.), there were a couple of interesting steps I’d like to mention here - again, without going into the details. 

As a first step, I tend to go through individual columns with the Pandas unique() method. That’s how I can see whether the values are consistent and sensible - and catch any potential issues. Then I check for data duplication by grouping rows by the column that serves as the unique identifier for a specific item - in this case I used product_id. 

One of the first things I noticed is that some product pages were linked to multiple listings in the search results page (coincidence? I don’t think so!). I got rid of the duplicates, but decided to keep this information for analysis. So I created a new column with the number of listings per item first, then deleted the copies except for one:

df['same_offer_count'] = df.groupby('product_id')['product_id'].transform('count')
df = df.drop_duplicates(subset='product_id', keep='first')

Another interesting problem was dealing with multiple currencies used throughout the marketplace. My raw data table contained bare price string values along with the quoted currency symbol (for example, ‘US $1.09’ or ‘C $2.42’), so I  needed to extract the numeric values and unify all the price currencies by converting them to USD. Here are a few example rows before the transformation:

list of products, models and prices

Here’s the code I used to transform it: 

import re

from currency_converter import CurrencyConverter

cc = CurrencyConverter() currency_shortcuts = {'C':'CAD', 'US':'USD', 'AU':'AUD'} # first I checked only these occur... regx_str=r'(\\w+\\s\*)\\$[ ]?(\\d+[.|,]?\\d+)' # note the two ‘re’ groups! df[['currency', 'quoted_price']] = df['current_price'].str.extract(pat=regx_str) df['currency'] = df['currency'].str.replace(' ', '') df['currency'] = df['currency'].map(currency_shortcuts) df['price_USD'] = df['quoted_price'].copy() for currency in [ c for c in df['currency'].unique() if c not in ['USD']]: fltr = df['currency'].isin([currency]) df.loc[fltr, 'price_USD'] = df.loc[fltr, 'quoted_price']\\ .apply(lambda x: cc.convert(x, currency, 'USD')) 

Which resulted in: 

list with products and prices after converting the currency

Next, I processed the sales history table (sale_history). I performed some basic type fixes, extracted and converted prices and currencies, and filled in the null values (code not shown). I ended up with this table (again, it’s just a snapshot of rows):

sales history table

To make it useful for my analysis and plotting, I aggregated the entries by date (and product_id, of course) and calculated the number of sold items and the daily selling rates. Wrapping all this into a single function allowed applying it row-wise to the data frame:

def calculate_sale_history_stats(df):
    “””Calculates statistics on sale history, returns new dataframe”””    
    delta = df['purchase_date'].max() - df['purchase_date'].min()
    days = int(delta.days)
    values = list(df['quantity_sold'])
    earnings = list(df['total_price'])
    sold_count = sum(values)
    if len(values) < days:
        values.extend([0]*(len(values) - days))
        earnings.extend([0]*(len(earnings) - days))
    res = pd.Series(        [sold_count, np.mean(values), np.std(values), np.mean(earnings), np.std(earnings)], 
'Mean_daily_sold_count', 'Sold_count_St.Dev',                
'Daily_earnings', 'Daily_earnings_St.Dev']    )  
    return round(res, 2)

And applying it to the sale_history dataframe: 

sale_history_stats = sale_history.groupby('brand').apply(calculate_sale_history_stats)

resulted in: 

Finally, I merged the aggregated sales stats (sale_history_stats) into the main df table:

df = pd.merge(
    right=df[['product_id','shipping_cost', 'shipping_from', 'top_rating_badge',
              'seller_feedback_score', 'seller_feedback_perc',]]

Here’s the resulting df table (again, only a selection of columns is shown): 

Now we’re good to go. So let’s begin our competitor analysis.

Q1: What are the average prices of our product type?

Let’s see how much profit we can make from screen protectors in general. How much do sellers generally charge for this type of product? 

I can analyze the prices in the marketplace and see what customers usually pay for smartphone screen protectors like this:

import matplotlib.pyplot as plt import seaborn as sns
prices = df['price_USD']

sns.distplot(df['price_USD'], ax=ax, bins=200)

paid_prices = sale_history['sell_price']

sns.distplot(paid_prices, ax=ax, bins=100)

And here are the two resulting histograms with overlaid additional information (code not shown):

graph showing prices paid for screen protectors

graph showing prices of available offers

As you can see, most screen protectors cost about $1.15 (average ~$3.9). However, it seems that customers often prefer to chip in a few extra bucks to their purchase (average of ~$5, median $~3.8). ‘The cheaper, the better’ rule doesn’t apply here.

Based on this insight, we may assume that choosing to price our product around $4 will do the job. 

Q2: How does pricing affect sellability? 

Pricing is probably the most important factor in the customer’s decision-making process. Sellers often assume that a high price may discourage consumers from buying their products. Striking the right balance between profitability and affordability can become a challenge.

Let’s check how the number of sold items and daily earnings match the unit price (as a daily average):

# The daily earnings vs price:
sns.lmplot(x='sell_price', y='Daily_earnings', 
           data=df[['sell_price', 'Daily_earnings']])

# Plot the sales frequency vs price:
sns.lmplot(x='sell_price', y='Mean_daily_sold_count', 
           data=df[['sell_price', 'Mean_daily_sold_count']],)

graph showing average daily number of sold items vs price

graph showing average daily earnings vs price

As expected, higher prices mean fewer sales on average. But when we look at the daily income, it seems that profits tend to increase with higher prices. 

At this point, it would be interesting to find out whether the pricing reflects the quality and/or reputation of the product. Unfortunately, that’s beyond the scope of this study.

Let’s take a closer look at brand names. There’s a spectrum of values referring to “no brand.”

So let’s clean this mess up first and label them all as “unbranded”: 

df['brand'] = df['brand'].apply(
     lambda s: 'Unbranded' if s in          
['Does not apply', 'Does Not Apply', 'Unbranded/Generic', ‘unbranded']   else s)

graph showing most commonly offered brands

Now I can put data into work and get a pie chart showing brand names. It indicates that the majority of the products offered on our online marketplace (c. 60 percent) has no branding (or don't indicate one) at all.

Consumers may want to stick to a recognizable name, so let’s ignore the unnamed products for a moment and focus instead on the top 20 brands offered in the marketplace that sell the highest number of products daily.  

For this, I will use our sale_history table with all the transactions data. 

Let’s create a table with information about the brands offered on the marketplace: 

sold_brands = sale_history.groupby('brand').apply(calculate_sale_history_stats)

Next, let’s see the top 10 brands that have recorded the highest number of sold items so far - create a table and plot it like this: 

top_sold_brands = sold_brands.sort_values(
    by=['Sold_count', 'Daily_earnings', 'Mean_daily_sold_count'], 

sns.barplot(data=top_sold_brands.iloc[1:21], x='brand', y='Sold_count')

table showing top brands with highest number of sold items

graph showing top brands with highest number of sold items

One glance is enough to see that all the unnamed brands combined have accumulated the highest number of items sold. However, Spigen appears to be the runner-up in this category and dominates the market among named brand products.

Q4. Which brands are the most profitable? 

Which screen protector brands brought sellers the highest revenue in the shortest time? Let’s bring the unbranded products back to the table as the picture might turn out slightly different when it comes to earnings:

most_profitable_brands = sold_brands.sort_values(    
by=['Daily_earnings', 'Mean_daily_sold_count', 'Sold_count'],    
most_profitable_brands = most_profitable_brands[[    
'brand', 'Daily_earnings', 'Daily_earnings_St.Dev','Sold_count',     'Mean_daily_sold_count', 'Mean_Sold_count_St.Dev']]

I get this table:

And let’s visualize it on a bar chart like so:, y, width=0.85, yerr=y_err, alpha=0.7, color='darkgrey', ecolor='black')

That should create the following graph:

graph showing brands with highest daily profit

It’s clear now that unbranded products aren’t that profitable. Note that the order of brands changed a lot when we compare revenue instead of the total number of sold items. Brands with mid-range prices are at the top of the list now (like PureGear that costs about $9.5). And that’s despite their relatively infrequent daily rate of sales (c. 1-2 daily). 

Answering these 4 questions showed us that ‘quality over quantity’ is probably the smartest way of crafting a sales strategy for online marketplaces. 


In this article, I focused on taking you through the data mining and preparation process to finally answer four key questions about sales trends on the online marketplace I chose for this study.

Here are some other questions I could answer from the data:

  • Do shipping costs impact customer decisions?
  • How many sellers discount their products?
  • Do discounts result in higher sales?
  • Does it make any difference to customers where they purchase products from?
  • How does the seller feedback score affect sales?
  • Does having a Top Rated badge boost product sales?

Luckily, I’ve already done that. 

You can get all the answers in my study How to increase sales on online marketplaces with data science. Download it here for free.

how to increases sales on online marketplaces with data science ebook

Sunscrapers Team

Sunscrapers empowers visionary leaders to ride the wave of the digital transformation with solutions that generate tangible business results. Thanks to agile and lean startup methods, we deliver high-quality software at top speed and efficiency.


business intelligence


Recent posts

See all blog posts

Are you ready for your next project?

Whether you need a full product, consulting, tech investment or an extended team, our experts will help you find the best solutions.