Background of the Requirement
Technical Background
In 2019, federated learning, as an emerging technology for data collaboration, began to spread in China. I studied and researched it, and applied it to JD's Data Federation and Mu Media Data Platform.
Traditional Data Integration: Integrate feature or label data into one party, and use data from both parties to train and obtain a model. This poses risks of privacy data leakage and data asset outflow.
Federated Learning: Data owners can conduct joint training (exchange encrypted training parameters) and obtain an adequately accurate model (with a small gap compared to traditional data integration model) without disclosing their original data. The training target is either non-individual information or is authorized by users, and no party can infer the other's original data.
Vertical Federated Learning
Data situation of each party: High ID overlap, low feature overlap.
Compliance constraints: Feature X is classified as private data or trade secrets and cannot be exported; the process of predicting Y' is either authorized by users or Y' is not classified as private.
Use case: Party A has features, and Party B has some training samples Y and feature dimensions. The goal is to optimize Party B's prediction model without exporting Party A's data.
Solution: Increase the feature dimension through vertical federated learning to assist Party B in predicting Y'.
Horizontal Federated Learning
Data situation of each party: Low ID overlap, high feature overlap.
Compliance constraints: ID anonymization + feature X cannot be exported, as feature X itself can also identify individual information or leak trade secrets, such as individual trajectories, financial records, call records, store turnovers, rents, etc.
Use case: Each party already has a prediction model, but due to a lack of samples, the model lacks training, and the model parameters are not optimized.
Solution: Increase training samples through horizontal federated learning to optimize each party's model parameters.
Business Pain Points
Challenge: Due to regulations prohibiting the exchange of user privacy information, how to exchange crowd ID packages across two or more data collaborators and carry out crowd count statistics and group portrait outputs without leaking individual information?
1. Poor support for estimating TA crowd at outlets
Problem description: Currently, the TA crowd estimate is only available on individual data sources, but each source has defects: such as insufficient label dimensions, insufficient sample penetration in specific scenarios/cities, and data lag. The business side hopes to estimate the distribution of target populations in key industries nationwide. Conclusion: The dimension quantity of industry labels and the overall sample magnitude cannot be met by a single data source. Also, according to legal compliance restrictions, we cannot directly merge labels at the individual ID level.
Problem level: Important and urgent
Solution: Use federated learning to connect multiple data sources and enrich the available label dimensions and sample magnitude at the POI level.
2. Inaccurate estimate of POI crowd at outlets
Problem description: The data source currently used to estimate the crowd at outlets has the problems of sparse self-report data and scene restrictions, especially in travel and consumption scenarios. Mobile operator data has low positioning accuracy (more than 200m), data lag of 30~50 days, and low penetration in some cities. We have now connected to the mobile operator's crowd data and used our self-reports to extrapolate the total number, which can initially meet the demand.
Problem level: Important but not urgent
Solution: By connecting to multiple data sources, a more accurate estimate of the crowd at outlets can be made.
3. Unavailable crowd and TA numbers for store POI
Problem description: Currently, neither our own data, major map providers, nor SDK can provide relatively accurate store data. XXX can provide sparse real store data; map providers provide POI scenario data; SDK partners can provide mall data; our own media hardware can provide accurate but limited store traffic data.
Problem level: Important but not urgent
Solution: A store traffic estimation model can be established by combining multiple data sources.
4. Traditional joint modeling has defects
Positive samples (Y) need to be exported, posing compliance risks or loss of data assets.
Each party can establish a complete model and sell it externally, resulting in the loss of model assets.
The overall model performance is not optimal when the upper layer uses ensemble learning.
Requirement Scenarios
[P1] Joint TA Concentration
- Ad placement selection: Identify the workplaces, residences, and places of visit of target customer groups in key industries and provide comparative crowd concentration values to target high-concentration target audiences with offline ads.
- Store location selection: Jointly use multiple data labels and crowd behavior records to measure the areas with the highest concentration of target populations.
[P0] Advertising Marketing - Cross-Scenario Effect Measurement
- Cognition → Attraction: The proportion of people exposed to offline advertisements who are driven to online malls; the proportion of people exposed to offline/online advertisements who are driven to shopping malls or stores; the proportion of people exposed to offline advertisements who engage in scanning behavior.
- Attraction → Action: Measure the number of people driven to shopping malls or stores who make a transaction/payment.
- Action → Loyalty: The number of people driven to shopping malls or stores who make multiple transactions or participate in brand topics online.
Work Content
- Compliance scheme design: Design and formulate a federated learning collaboration scheme; report innovative projects with the Legal and Compliance Department and Government Cooperation Department and explore the possibility of compliance.
- Lead the Data Federation organization: Negotiate and connect with external mobile operators, map providers, SDK-like companies, and internally coordinate various business departments, technical platforms, risk control centers, data assets, legal compliance, government cooperation, etc., to promote data cooperation, product development, and model implementation.
- Coordinate the development of the Federation platform: Organize various internal technical platform departments, combine product, algorithm, and data capabilities, and promote the development of a federated learning platform with the company's own intellectual property rights.
- Model Implementation: Design and promote the federated data model scheme for advertising marketing.
Project Results
- Completed the first federated project landing of the Data Federation, achieving three-party joint training and inference for POI crowd prediction;
- The daily POI crowd prediction P30 indicator reached 90%, serving six internal and external systems;
- Saved over 100 million yuan in data purchase expenses;
- Won the China Academy of Information and Communications Technology & China Communications Standards Association 2020 Data Asset Management Conference - Outstanding Case Award for Privacy Computing