This table contains sales done by each customer by country along with the selling price(cost) & date.
Using below pyspark code to read the above csv file from DBFS in Azure Databricks notebook in spark dataframe.
#import library import pyspark from pyspark.sql import Window, functions as F SourcePath='path' # this will contain the path of your file Sales = spark.read.format('csv').options(header='true', inferSchema='true').load(SourcePath) display(Sales) #display to see the data
Business wants to analyze the sales happening across Region and also they want to see the performance of the Customer(Business Partner) who are helping in doing the sales. Customer who is doing better business will be awarded incentive to maintain healthy relationship.
Sales by Region
The column on basis of which grouping is done is passed as parameter in groupBy(). It can also take multiple column name as parameters.
Fetching the sales per Region (CountryCode) with total Sales Cost, can be achieved by using groupBy function and aggregate Sum function from pyspark Function module.
SalesByRegion= Sales.groupBy("CountryCode").agg(F.sum("Cost").alias("SalesAmount")) display(SalesByRegion)
Sales by Region & Customer
Below code is get the sales done by customer across region.With this business can get the insight on the customer who is doing better business.
#Analyzing data with aggregations on a DataFrame #GroupBy allows you to group rows together based on some column value, for example, you could group together purchase data on the basis of Country Code and Customer Name SalesByRegionCustomer= Sales.groupBy("CustomerName","CountryCode").agg(F.sum("Cost").alias("Total")) display(SalesByRegionCustomer)