基于零售数据集的消费者购物数据分析

但是，神经网络在类别1的识别优势不是特别大，在考虑随机森林具有较强解释性（例如给出的重要性排序图）的情况下，这点优势可以忽略，所以综合对比分析可以得到，在执行探索性数据分析、创建可视化还是训练机器学习模型，该数据集都能提供有价值的见解，以支持零售业的数据驱动决策。各属性的数值类型正确，即数值型数据均为对应数值类型，其他为object类型。#综合考虑多维度交叉分析，以年龄区间和消费金额区间交叉为例。

大鹏bmfm

1272人浏览 · 2024-12-31 16:22:47

大鹏bmfm · 2024-12-31 16:22:47 发布

关于数据

数据集获取请评论区留言

该数据集提供了消费者购物趋势的全面视图，旨在揭示零售购买的模式和行为。它包含各种产品类别、客户人口统计和购买渠道的详细交易数据。主要功能可能包括：

交易详情：购买日期、交易价值、产品类别和付款方式。
客户信息：年龄组、性别、位置和忠诚度状态。
购物行为：购买频率、每笔交易的平均支出和季节性趋势。

这个数据集对于数据科学家、分析师和营销人员来说是理想的选择：

随着时间的推移分析消费者的购买模式。
确定流行的产品类别和高绩效细分市场。
制定客户细分和个性化策略。
为销售预测或客户保留建立预测模型。

在执行探索性数据分析、创建可视化还是训练机器学习模型，该数据集都能提供有价值的见解，以支持零售业的数据驱动决策。

数据预处理

In [29]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')

In [30]:

df = pd.read_csv("./shopping_trends.csv")

In [31]:

df.head()

	Customer ID	Age	Gender	Item Purchased	Category	Purchase Amount (USD)	Location	Size	Color	Season	Review Rating	Subscription Status	Payment Method	Shipping Type	Discount Applied	Promo Code Used	Previous Purchases	Preferred Payment Method	Frequency of Purchases
0	1	55	Male	Blouse	Clothing	53	Kentucky	L	Gray	Winter	3.1	Yes	Credit Card	Express	Yes	Yes	14	Venmo	Fortnightly
1	2	19	Male	Sweater	Clothing	64	Maine	L	Maroon	Winter	3.1	Yes	Bank Transfer	Express	Yes	Yes	2	Cash	Fortnightly
2	3	50	Male	Jeans	Clothing	73	Massachusetts	S	Maroon	Spring	3.1	Yes	Cash	Free Shipping	Yes	Yes	23	Credit Card	Weekly
3	4	21	Male	Sandals	Footwear	90	Rhode Island	M	Maroon	Spring	3.5	Yes	PayPal	Next Day Air	Yes	Yes	49	PayPal	Weekly
4	5	45	Male	Blouse	Clothing	49	Oregon	M	Turquoise	Spring	2.7	Yes	Cash	Free Shipping	Yes	Yes	31	PayPal	Annually

In [32]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Customer ID               3900 non-null   int64  
 1   Age                       3900 non-null   int64  
 2   Gender                    3900 non-null   object 
 3   Item Purchased            3900 non-null   object 
 4   Category                  3900 non-null   object 
 5   Purchase Amount (USD)     3900 non-null   int64  
 6   Location                  3900 non-null   object 
 7   Size                      3900 non-null   object 
 8   Color                     3900 non-null   object 
 9   Season                    3900 non-null   object 
 10  Review Rating             3900 non-null   float64
 11  Subscription Status       3900 non-null   object 
 12  Payment Method            3900 non-null   object 
 13  Shipping Type             3900 non-null   object 
 14  Discount Applied          3900 non-null   object 
 15  Promo Code Used           3900 non-null   object 
 16  Previous Purchases        3900 non-null   int64  
 17  Preferred Payment Method  3900 non-null   object 
 18  Frequency of Purchases    3900 non-null   object 
dtypes: float64(1), int64(4), object(14)
memory usage: 579.0+ KB

各属性的数值类型正确，即数值型数据均为对应数值类型，其他为object类型

In [33]:

df.isnull().sum()

Customer ID                 0
Age                         0
Gender                      0
Item Purchased              0
Category                    0
Purchase Amount (USD)       0
Location                    0
Size                        0
Color                       0
Season                      0
Review Rating               0
Subscription Status         0
Payment Method              0
Shipping Type               0
Discount Applied            0
Promo Code Used             0
Previous Purchases          0
Preferred Payment Method    0
Frequency of Purchases      0
dtype: int64

无缺失值

数据分析

1.简单统计分析

（1）描述性统计

In [34]:

df.describe().drop('Customer ID', axis=1)

	Age	Purchase Amount (USD)	Review Rating	Previous Purchases
count	3900.000000	3900.000000	3900.000000	3900.000000
mean	44.068462	59.764359	3.749949	25.351538
std	15.207589	23.685392	0.716223	14.447125
min	18.000000	20.000000	2.500000	1.000000
25%	31.000000	39.000000	3.100000	13.000000
50%	44.000000	60.000000	3.700000	25.000000
75%	57.000000	81.000000	4.400000	38.000000
max	70.000000	100.000000	5.000000	50.000000

对除'Customer ID'之外的数值型属性进行描述性分析：

年龄分布：客户年龄分布较广，但主要集中在31岁到57岁之间，平均数与中位数均为44岁，说明中年人群是主要客户群体。
购买金额：购买金额的波动较大，可能与购买的商品种类、数量或促销活动有关。
评价评分：客户的评价普遍较高，评分集中在3.1分到4.4分之间，表明客户满意度较好。
购买频率：客户的购买频率差异较大，中位数与平均数均为25次，说明频繁购买客户较多。

In [35]:

df

	Customer ID	Age	Gender	Item Purchased	Category	Purchase Amount (USD)	Location	Size	Color	Season	Review Rating	Subscription Status	Payment Method	Shipping Type	Discount Applied	Promo Code Used	Previous Purchases	Preferred Payment Method	Frequency of Purchases
0	1	55	Male	Blouse	Clothing	53	Kentucky	L	Gray	Winter	3.1	Yes	Credit Card	Express	Yes	Yes	14	Venmo	Fortnightly
1	2	19	Male	Sweater	Clothing	64	Maine	L	Maroon	Winter	3.1	Yes	Bank Transfer	Express	Yes	Yes	2	Cash	Fortnightly
2	3	50	Male	Jeans	Clothing	73	Massachusetts	S	Maroon	Spring	3.1	Yes	Cash	Free Shipping	Yes	Yes	23	Credit Card	Weekly
3	4	21	Male	Sandals	Footwear	90	Rhode Island	M	Maroon	Spring	3.5	Yes	PayPal	Next Day Air	Yes	Yes	49	PayPal	Weekly
4	5	45	Male	Blouse	Clothing	49	Oregon	M	Turquoise	Spring	2.7	Yes	Cash	Free Shipping	Yes	Yes	31	PayPal	Annually
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3895	3896	40	Female	Hoodie	Clothing	28	Virginia	L	Turquoise	Summer	4.2	No	Cash	2-Day Shipping	No	No	32	Venmo	Weekly
3896	3897	52	Female	Backpack	Accessories	49	Iowa	L	White	Spring	4.5	No	PayPal	Store Pickup	No	No	41	Bank Transfer	Bi-Weekly
3897	3898	46	Female	Belt	Accessories	33	New Jersey	L	Green	Spring	2.9	No	Credit Card	Standard	No	No	24	Venmo	Quarterly
3898	3899	44	Female	Shoes	Footwear	77	Minnesota	S	Brown	Summer	3.8	No	PayPal	Express	No	No	24	Venmo	Weekly
3899	3900	52	Female	Handbag	Accessories	81	California	M	Beige	Spring	3.1	No	Bank Transfer	Store Pickup	No	No	33	Venmo	Quarterly

3900 rows × 19 columns

（2）数值型数据的相关性分析

In [36]:

# 数值型数据相关性
corr = df[['Age', 'Purchase Amount (USD)','Review Rating', 'Previous Purchases']].corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='Purples', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

根据热力图结果可以发现，其中数值型数据之间并不存在明显相关性，即Age,Purchase Amount (USD),Review Rating,Previous Purchase之间无明显相关性

2.商品类别

（1）各项商品与金额

In [37]:

plt.figure(figsize=(18, 9))
sns.barplot(x='Item Purchased', y='Purchase Amount (USD)', data=df, palette='coolwarm')
plt.title('Purchase Amount by Item Purchased')
plt.show()

（2）商品类别与金额——箱型图

In [38]:

sns.catplot(data=df, x='Category', y='Purchase Amount (USD)', kind='box', height=6, aspect=2)
plt.title("Purchase Amount Distribution by Category")
plt.show()

（3）商品类别与金额——小提琴图

In [39]:

plt.figure(figsize=(10, 6))
sns.violinplot(x='Category', y='Purchase Amount (USD)', data=df, inner='quart')
plt.title("Violin Plot: Purchase Amount by Category")
plt.show()

（4）商品类别与数量

In [40]:

plt.figure(figsize=(10, 6))
sns.countplot(x='Category', data=df)
plt.title('Count of Items Purchased by Category')
plt.show()

（5）购买频率占比

In [41]:

plt.figure(figsize=(6, 4))
counts = df['Category'].value_counts()
explode = (0, 0.1, 0.2, 0.3) 

colors = ['#A85CF9', '#FF4949', '#BDF2D5', '#FF06B7', '#4B7BE5', '#FF5D5D', '#FAC213', '#37E2D5', '#6D8B74', '#E9D5CA']

counts.plot(kind='pie', fontsize=12, colors=colors, explode=explode, autopct='%1.1f%%')
plt.axis('equal')
plt.legend(labels=counts.index, loc='best')
plt.show()

（6）各产品销量

In [42]:

def barw(ax):
    for p in ax.patches:
        val = p.get_width()
        x = p.get_x() + p.get_width()
        y = p.get_y() + p.get_height() / 2
        ax.annotate(int(val), (x, y))


plt.figure(figsize=(16, 9))
# 获取不同商品的数量
item_counts = df['Item Purchased'].value_counts()
# 生成颜色列表
colors = sns.color_palette("hls", len(item_counts))
ax0 = sns.countplot(data=df, y='Item Purchased', order=df['Item Purchased'].value_counts().index, palette=colors)

# 计算购买次数均值
mean_count = df['Item Purchased'].value_counts().mean()
# 添加红色虚线表示均值
line = ax0.axvline(mean_count, color='r', linestyle='--')
barw(ax0)
# 添加图例
ax0.legend([line], [f'avg_count={mean_count:.2f}'])

plt.show()

In [43]:

# 筛选出购买次数大于均值的商品，并按照购买次数降序排列
above_mean_items = item_counts[item_counts > mean_count].sort_values(ascending=False).reset_index()
above_mean_items.columns = ['Item Purchased', 'Purchase Times']
# 按照指定格式输出
result = above_mean_items.apply(lambda x: f"{x['Item Purchased']}({x['Purchase Times']})", axis=1)
result.tolist()

['Blouse(171)',
 'Pants(171)',
 'Jewelry(171)',
 'Shirt(169)',
 'Dress(166)',
 'Sweater(164)',
 'Jacket(163)',
 'Coat(161)',
 'Sunglasses(161)',
 'Belt(161)',
 'Sandals(160)',
 'Socks(159)',
 'Skirt(158)',
 'Scarf(157)',
 'Shorts(157)']

3.位置信息

In [44]:

# 统计Location列每个值出现的次数
location_counts = df['Location'].value_counts()

# 取前十个最常见的值及其计数
top_10_locations = location_counts[:10]

# 计算每个位置的占比
total_count = top_10_locations.sum()
ratios = top_10_locations / total_count

# 创建包含地理位置和比例的数据框
pd.DataFrame({'Location': top_10_locations.index, 'Ratio': ratios})

	Location	Ratio
Location
Montana	Montana	0.106667
California	California	0.105556
Idaho	Idaho	0.103333
Illinois	Illinois	0.102222
Alabama	Alabama	0.098889
Minnesota	Minnesota	0.097778
New York	New York	0.096667
Nevada	Nevada	0.096667
Nebraska	Nebraska	0.096667
Delaware	Delaware	0.095556

可以看出各个地方占比十分接近

In [45]:

my_circle = plt.Circle((0, 0), 0.9, color='white')

plt.pie(df['Location'].value_counts()[:10].values, 
        labels=df['Location'].value_counts()[:10].index)

p = plt.gcf()
p.gca().add_artist(my_circle)
plt.show()

4.性别对比

（1）男女购买金额对比

In [46]:

plt.figure(figsize=(11, 5))
plt.gcf().text(0.55, 0.95, "Box Plot", fontsize=40, color='Red', ha='center', va='center')

sns.boxenplot(x=df['Gender'], y=df['Purchase Amount (USD)'], palette="Set1")

plt.show()

（2）男女定量数据分布对比

In [47]:

import math
# 统计符合条件的列的数量
count = sum(1 for col in df.columns if df[col].dtype in ['int64', 'float64'])
# 计算行数和列数
cols = math.ceil(math.sqrt(count))
rows = math.ceil(count / cols)

plt.figure(figsize=(20, 12))

i = 1
for column in df.columns:
    if df[column].dtype in ['int64', 'float64']:
        plt.subplot(rows, cols, i)
        df[df['Gender'] == 'Male'][column].hist(bins=35, color='blue', label='Male', alpha=0.9)
        df[df['Gender'] == 'Female'][column].hist(bins=35, color='red', label='Female', alpha=0.5)

        plt.legend()
        plt.xlabel(column)
        i += 1

plt.tight_layout()
plt.show()

In [48]:

df

	Customer ID	Age	Gender	Item Purchased	Category	Purchase Amount (USD)	Location	Size	Color	Season	Review Rating	Subscription Status	Payment Method	Shipping Type	Discount Applied	Promo Code Used	Previous Purchases	Preferred Payment Method	Frequency of Purchases
0	1	55	Male	Blouse	Clothing	53	Kentucky	L	Gray	Winter	3.1	Yes	Credit Card	Express	Yes	Yes	14	Venmo	Fortnightly
1	2	19	Male	Sweater	Clothing	64	Maine	L	Maroon	Winter	3.1	Yes	Bank Transfer	Express	Yes	Yes	2	Cash	Fortnightly
2	3	50	Male	Jeans	Clothing	73	Massachusetts	S	Maroon	Spring	3.1	Yes	Cash	Free Shipping	Yes	Yes	23	Credit Card	Weekly
3	4	21	Male	Sandals	Footwear	90	Rhode Island	M	Maroon	Spring	3.5	Yes	PayPal	Next Day Air	Yes	Yes	49	PayPal	Weekly
4	5	45	Male	Blouse	Clothing	49	Oregon	M	Turquoise	Spring	2.7	Yes	Cash	Free Shipping	Yes	Yes	31	PayPal	Annually
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3895	3896	40	Female	Hoodie	Clothing	28	Virginia	L	Turquoise	Summer	4.2	No	Cash	2-Day Shipping	No	No	32	Venmo	Weekly
3896	3897	52	Female	Backpack	Accessories	49	Iowa	L	White	Spring	4.5	No	PayPal	Store Pickup	No	No	41	Bank Transfer	Bi-Weekly
3897	3898	46	Female	Belt	Accessories	33	New Jersey	L	Green	Spring	2.9	No	Credit Card	Standard	No	No	24	Venmo	Quarterly
3898	3899	44	Female	Shoes	Footwear	77	Minnesota	S	Brown	Summer	3.8	No	PayPal	Express	No	No	24	Venmo	Weekly
3899	3900	52	Female	Handbag	Accessories	81	California	M	Beige	Spring	3.1	No	Bank Transfer	Store Pickup	No	No	33	Venmo	Quarterly

3900 rows × 19 columns

（3）购买频率与性别、支付方式

In [49]:

cat = ['Gender', 'Payment Method']

fig, ax = plt.subplots(1, 2, figsize=(16, 8))

for indx, (column, axes) in list(enumerate(list(zip(cat, ax.flatten())))):
    sns.countplot(ax=axes, x=df[column], hue=df['Frequency of Purchases'], palette='magma', alpha=0.8)
    axes.set_title(f'Count of {column} by Frequency of Purchases')

if len(cat) < len(ax.flatten()):
    [axes.set_visible(False) for axes in ax.flatten()[len(cat):]]

plt.tight_layout()
plt.show()

In [50]:

cat = ['Gender', 'Payment Method']

fig, ax = plt.subplots(1, 2, figsize=(16, 8))

for indx, (column, axes) in enumerate(zip(cat, ax.flatten())):
    if column == 'Gender':
        # 按性别分组并计算各购买频率的比例
        gender_counts = df.groupby(['Gender', 'Frequency of Purchases']).size().reset_index(name='count')
        total_per_gender = df.groupby('Gender').size().reset_index(name='total')
        gender_merged = gender_counts.merge(total_per_gender, on='Gender')
        gender_merged['frequency'] = gender_merged['count'] / gender_merged['total']

        sns.barplot(ax=axes, x='Gender', y='frequency', hue='Frequency of Purchases', data=gender_merged, palette='magma', alpha=0.8)
        axes.set_title(f'Frequency of Purchases by {column}')
        axes.set_ylabel('Frequency')
    else:
        sns.countplot(ax=axes, x=df[column], hue=df['Frequency of Purchases'], palette='magma', alpha=0.8)
        axes.set_title(f'Count of {column} by Frequency of Purchases')
    # 将图例放在右下角
    axes.legend(loc='lower right')

if len(cat) < len(ax.flatten()):
    [axes.set_visible(False) for axes in ax.flatten()[len(cat):]]

plt.tight_layout()
plt.show()

（4）不同季节男女消费金额

In [51]:

sns.catplot(x="Gender", y="Purchase Amount (USD)", col="Season",
            kind="boxen", palette="Set2", height=5, aspect=1, data=df, col_wrap=2)
plt.show()

（5）对不同产品的消费金额对比

In [52]:

plt.figure(figsize=(16, 9))
params = dict(data=df, x='Category', y='Purchase Amount (USD)', hue='Gender', dodge=True)

# 散点图
sns.stripplot(**params, size=8, jitter=0.35, palette=['#33FF66', '#FF6600'], edgecolor='black', linewidth=1)
# 箱型图
sns.boxplot(**params, palette=['#BDBDBD', '#E0E0E0'], linewidth=6)

plt.show()

（6）不同性别不同年龄购买频率

In [53]:

y = df['Gender']

plt.figure(figsize=(10, 6))
g = sns.kdeplot(df["Age"][(y == 'Male') & (df["Age"].notnull())], color="Red", shade=True)
g = sns.kdeplot(df["Age"][(y == 'Female') & (df["Age"].notnull())], ax=g, color="Blue", shade=True)
g.set_xlabel("Age")
g.set_ylabel("Frequency")
g = g.legend(["Male", "Female"])

plt.show()

In [ ]:

4.用户画像分析

In [54]:

# 年龄区间划分函数，将年龄划分到不同阶段，方便后续统计分析
def categorize_age(age):
    if age < 25:
        return '青年'
    elif age < 45:
        return '中年'
    return '老年'

# 在数据框中新增年龄区间列
df['Age_Group'] = df['Age'].apply(categorize_age)

# 消费金额区间划分函数，这里简单划分高低两个档次，可按需细化调整
def categorize_amount(amount):
    if amount < 50:
        return '低消费'
    return '高消费'

# 在数据框中新增消费金额区间列
df['Amount_Group'] = df['Purchase Amount (USD)'].apply(categorize_amount)

In [55]:

#  综合考虑多维度交叉分析，以年龄区间和消费金额区间交叉为例
cross_analysis = df.groupby(['Age_Group', 'Amount_Group']).agg({
    'Frequency of Purchases': lambda x: x.mode()[0],
    'Payment Method': lambda x: x.mode()[0],
    'Item Purchased': lambda x: x.mode()[0]
}).reset_index()

print("多维度交叉（年龄区间与消费金额区间）的用户画像分析：")
cross_analysis

多维度交叉（年龄区间与消费金额区间）的用户画像分析：

	Age_Group	Amount_Group	Frequency of Purchases	Payment Method	Item Purchased
0	中年	低消费	Weekly	Debit Card	Sandals
1	中年	高消费	Monthly	Credit Card	Backpack
2	老年	低消费	Annually	Cash	Coat
3	老年	高消费	Quarterly	Venmo	Dress
4	青年	低消费	Every 3 Months	Bank Transfer	Pants
5	青年	高消费	Annually	Bank Transfer	Dress

随机森林模型训练

数据集划分

In [56]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

In [57]:

categorical_cols = ['Gender', 'Item Purchased', 'Category', 'Location', 'Size', 'Color', 
                    'Season', 'Subscription Status', 'Payment Method', 'Shipping Type', 
                    'Promo Code Used', 'Preferred Payment Method', 'Frequency of Purchases']

encoder = LabelEncoder()

for col in categorical_cols:
    df[col] = encoder.fit_transform(df[col])

df

	Customer ID	Age	Gender	Item Purchased	Category	Purchase Amount (USD)	Location	Size	Color	Season	...	Subscription Status	Payment Method	Shipping Type	Discount Applied	Promo Code Used	Previous Purchases	Preferred Payment Method	Frequency of Purchases	Age_Group	Amount_Group
0	1	55	1	2	1	53	16	0	7	3	...	1	2	1	Yes	1	14	5	3	老年	高消费
1	2	19	1	23	1	64	18	0	12	3	...	1	0	1	Yes	1	2	1	3	青年	高消费
2	3	50	1	11	1	73	20	2	12	1	...	1	1	2	Yes	1	23	2	6	老年	高消费
3	4	21	1	14	2	90	38	1	12	1	...	1	4	3	Yes	1	49	4	6	青年	高消费
4	5	45	1	2	1	49	36	1	21	1	...	1	1	2	Yes	1	31	4	0	老年	低消费
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3895	3896	40	0	9	1	28	45	0	21	2	...	0	1	0	No	0	32	5	6	中年	低消费
3896	3897	52	0	0	0	49	14	0	23	1	...	0	4	5	No	0	41	0	1	老年	低消费
3897	3898	46	0	1	0	33	29	0	8	1	...	0	2	4	No	0	24	5	5	老年	低消费
3898	3899	44	0	17	2	77	22	2	3	2	...	0	4	1	No	0	24	5	6	中年	高消费
3899	3900	52	0	7	0	81	4	1	0	1	...	0	0	5	No	0	33	5	5	老年	高消费

3900 rows × 21 columns

In [58]:

# Features (X) and Label (y)
X = df.drop(columns=['Customer ID', 'Subscription Status'])  # 将ID与label给去掉
y = df['Subscription Status']  # label

In [59]:

numerical_cols = ['Age', 'Purchase Amount (USD)', 'Review Rating', 'Previous Purchases']

scaler = StandardScaler()
X[numerical_cols] = scaler.fit_transform(X[numerical_cols])

X.head()

	Age	Gender	Item Purchased	Category	Purchase Amount (USD)	Location	Size	Color	Season	Review Rating	Payment Method	Shipping Type	Discount Applied	Promo Code Used	Previous Purchases	Preferred Payment Method	Frequency of Purchases	Age_Group	Amount_Group
0	0.718913	1	2	1	-0.285629	16	0	7	3	-0.907584	2	1	Yes	1	-0.785831	5	3	老年	高消费
1	-1.648629	1	23	1	0.178852	18	0	12	3	-0.907584	0	1	Yes	1	-1.616552	1	3	青年	高消费
2	0.390088	1	11	1	0.558882	20	2	12	1	-0.907584	1	2	Yes	1	-0.162789	2	6	老年	高消费
3	-1.517099	1	14	2	1.276716	38	1	12	1	-0.349027	4	3	Yes	1	1.637107	4	6	青年	高消费
4	0.061263	1	2	1	-0.454531	36	1	21	1	-1.466141	1	2	Yes	1	0.391025	4	0	老年	低消费

In [60]:

# 将标签编码应用于剩余的对象类型列
for col in X.select_dtypes(include='object').columns:
    X[col] = encoder.fit_transform(X[col])

In [61]:

#划分为 train and test 数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [62]:

from sklearn import preprocessing 

label_encoder = preprocessing.LabelEncoder() 

df['Discount Applied']= label_encoder.fit_transform(df['Discount Applied'])

In [63]:

df

	Customer ID	Age	Gender	Item Purchased	Category	Purchase Amount (USD)	Location	Size	Color	Season	...	Subscription Status	Payment Method	Shipping Type	Discount Applied	Promo Code Used	Previous Purchases	Preferred Payment Method	Frequency of Purchases	Age_Group	Amount_Group
0	1	55	1	2	1	53	16	0	7	3	...	1	2	1	1	1	14	5	3	老年	高消费
1	2	19	1	23	1	64	18	0	12	3	...	1	0	1	1	1	2	1	3	青年	高消费
2	3	50	1	11	1	73	20	2	12	1	...	1	1	2	1	1	23	2	6	老年	高消费
3	4	21	1	14	2	90	38	1	12	1	...	1	4	3	1	1	49	4	6	青年	高消费
4	5	45	1	2	1	49	36	1	21	1	...	1	1	2	1	1	31	4	0	老年	低消费
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3895	3896	40	0	9	1	28	45	0	21	2	...	0	1	0	0	0	32	5	6	中年	低消费
3896	3897	52	0	0	0	49	14	0	23	1	...	0	4	5	0	0	41	0	1	老年	低消费
3897	3898	46	0	1	0	33	29	0	8	1	...	0	2	4	0	0	24	5	5	老年	低消费
3898	3899	44	0	17	2	77	22	2	3	2	...	0	4	1	0	0	24	5	6	中年	高消费
3899	3900	52	0	7	0	81	4	1	0	1	...	0	0	5	0	0	33	5	5	老年	高消费

3900 rows × 21 columns

模型训练

In [64]:

model_RF = RandomForestClassifier(random_state=42, n_estimators=100)
model_RF.fit(X_train, y_train)

y_pred = model_RF.predict(X_test)

评估模型

In [65]:

print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy Score: 0.8602564102564103

Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.85      0.90       569
           1       0.69      0.89      0.78       211

    accuracy                           0.86       780
   macro avg       0.82      0.87      0.84       780
weighted avg       0.88      0.86      0.87       780

In [66]:

#  混淆矩阵可视化
plt.figure(figsize=(6, 4))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

重要性排序

In [67]:

importances = model_RF.feature_importances_
features = X.columns

# 将特征重要性和特征名称组合在一起，并按照重要性进行降序排序
feature_importance_data = sorted(zip(importances, features), reverse=True)
importances_sorted, features_sorted = zip(*feature_importance_data)

plt.figure(figsize=(12, 8))
# 绘制柱状图，按照降序排列的顺序绘制
sns.barplot(x=importances_sorted, y=features_sorted, palette='viridis')

# 计算重要性的均值
avg_importance = np.mean(importances_sorted)

# 添加红色竖立的虚线表示重要性均值
plt.axvline(x=avg_importance, color='r', linestyle='--', label=f'avg_importance={avg_importance:.2f}')

plt.title("Feature Importance")
plt.xlabel("Importance Score")
plt.ylabel("Features")
# 添加图例，设置图例位置等属性让其显示更合理
plt.legend(fontsize='medium')
plt.show()

In [68]:

higher_than_avg_features = [(feature, importance) for importance, feature in zip(importances_sorted, features_sorted) if importance > avg_importance]
print("高于平均值的属性：")
for feature, importance in higher_than_avg_features:
    print(f"{feature}({importance:.3f})")

高于平均值的属性：
Promo Code Used(0.228)
Discount Applied(0.203)
Purchase Amount (USD)(0.054)
Previous Purchases(0.054)

神经网络

模型训练

In [69]:

from sklearn.neural_network import MLPClassifier

In [70]:

model_bp = MLPClassifier(hidden_layer_sizes=(5, 3))
model_bp.fit(X_train, y_train)

y_pred = model_bp.predict(X_test)

评估模型

In [71]:

print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy Score: 0.8525641025641025

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.83      0.89       569
           1       0.66      0.92      0.77       211

    accuracy                           0.85       780
   macro avg       0.81      0.87      0.83       780
weighted avg       0.88      0.85      0.86       780

In [72]:

#  混淆矩阵可视化
plt.figure(figsize=(6, 4))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

机器学习模型对比分析

具体分析：

在准确率方面：随机森林与神经网络模型的十分接近，准确率基本一致。
在类别 1 识别：首先，在召回率方面神经网络（95%）相比于随机森林的（89%）有优势。但是二者的准确率都较低，神经网络为66%，随机森林为69%。
在类别 0 识别：二者的精确率与找回率都较高。其中神经网络分别为（99%、82%），随机森林分别为（95%、85%），二者之间差距不明显。

总结对比可以得知，在准确率、类别0识别的差距都不明显的情况下，神经网络在类别1的识别效果更佳。

但是，神经网络在类别1的识别优势不是特别大，在考虑随机森林具有较强解释性（例如给出的重要性排序图）的情况下，这点优势可以忽略，所以综合对比分析可以得到，两个模型中，随机森林是更优的一个选择。

数据集获取请评论区留言

脑启社区

脑启社区是一个专注类脑智能领域的开发者社区。欢迎加入社区，共建类脑智能生态。社区为开发者提供了丰富的开源类脑工具软件、类脑算法模型及数据集、类脑知识库、类脑技术培训课程以及类脑应用案例等资源。

更多推荐

快讯｜复旦发布全球首篇WAM系统性综述366篇论文绘制技术版图，飞捷科思自研可微分物理引擎Fysics指标超8B模型，维泛智能类脑芯片BiGPU融合ANN与SNN，Sim2Real实证：空间特征泛化远

脑启社区

EM-Core自动驾驶类脑世界模型——全域客观认知底座（V1.0 正式版）

本文档为 EM-Core 自动驾驶认知系统的核心认知底座规范，是 ECC 认知大脑开展推理、预判、决策的**唯一客观依据**。本模型与 MLNF-Mem 记忆中枢完全物理解耦，作为漏斗外侧独立挂载的外置模块（ad-44）运行，仅通过 `WM_QUERY` 标准消息向 ECC-01 情境解析模块和 ECC-03 因果推理模块提供风险向量与属性查询服务，不参与记忆晋升、遗忘或行为决策。适用于全场景自动