br SWATH MS data processing
SWATH-MS data processing in OpenSWATH
The SWATH-MS data was analyzed using OpenSWATH (Ro¨st et al., 2014) with the following parameters: Chromatograms were ex-tracted with 0.05 Th around the expected mass of the fragment ions and with an extraction window of ± 5 min around the expected retention time (see Data S3C for justification). The best models to separate true from false positives (per run) were determined by pyProphet with 10 cross-validations. The runs were subsequently aligned with a target FDR of 0.01 for aligned features (Ro¨st et al., 2016). Background signals were extracted for features that could not be confidently identified (Ro¨st et al., 2016). To reduce the size of the output data and remove low-quality features, two filtering steps were introduced: (i) keep only the 10 most intense peptide features per protein and (ii) of these, keep only features that were identified with an FDR < 0.01 in at least four samples over all runs, corresponding to the smallest tumor group in the dataset defined by a combination of subtype and lymph node status.
All statistical tests were two-tailed and the results were considered statistically significant at alpha = 0.05 or FDR = 0.05, if not stated otherwise. Definition of error bars in all figures: Boxes are extended from the 25th to the 75th percentile, with a line at the median. The whiskers extend to the most extreme data point which is no more than 1.5 times the interquartile range (IQR) from the box. The individual points represent outliers or extreme values.
Relative quantification with MSstats and differential protein AM-251 analysis between subtypes and related clinical-pathological variables
We used the R (version 3.0.3) package MSstats 2.1.3 (Choi et al., 2014) for relative quantification of protein levels among the five different breast cancer conventional subtypes and related clinical-pathological variables (ER, grade, HER2, lymph node status). Before MSstats and correlation analysis, the OpenSWATH output was further reduced to contain up to five peptide features per pro-tein and the intensities were log2 transformed and median-equalized. The differences in protein expression between conventional subtypes and related clinical-pathological variables were compared pairwise using mixed effect models as implemented in the groupComparison function of MSstats, with expanded scope of biological and restricted scope of technical replication. Resulting p values were corrected for multiple hypotheses testing by the Benjamini-Hochberg method.
KEGG pathway analysis
The list of 4,443 proteins in the SWATH-MS library of assays (Data S2A) and the list of SWATH-MS 2,842 quantified proteins (Data S3) were inserted in Kyoto Encyclopedia of Genes and Genomes (KEGG) Mapper (https://www.kegg.jp/kegg/tool/map_pathway2.html), searched against hsa (Homo sapiens) database the subset of proteins related to Pathways in cancer (hsa05200) was displayed.
Gene set enrichment analysis
Gene set enrichment analysis (GSEA) in GSEA Java desktop application (http://software.broadinstitute.org/gsea/login.jsp) was con-ducted using the pre-ranked list (according to protein fold changes between ER+/ER-, tumor grade 3/grade 1, HER2+/HER2-, lymph node positive/negative patient groups) of 2,842 proteins quantified by SWATH-MS to find pathways enriched in ER+, high grade, HER2+, and lymph node positive phenotypes separately, with a priori defined pathways from BioCarta (https://cgap.nci.nih.gov/ Pathways/BioCarta_Pathways). We used default settings, except that we decreased the minimal size of a gene set to 1 and we did not use any normaliation method to normalize the enrichment scores across analyzed gene sets.
Correlation analysis of breast cancer tissue proteomes
For the correlation analysis of the pooled samples, label-free quantification was conducted using the R package aLFQ (1.3.2) (Rose-nberger et al., 2014). The method ProteinInference with default parameters (summing the three most intense transitions per peptide and averaging the two most intense peptides per protein) but without consensus feature selection was used to compute a protein intensity for all 1,832 proteins for which at least one peptide has been quantified by OpenSWATH (only including proteotypic peptides). Hierarchical clustering with Spearman’s correlation-based distance matrix and average linkage algorithm was performed in Perseus 188.8.131.52 software (https://www.maxquant.org) on log2 transformed, Z-score normalized (on both samples and proteins according to median) protein abundance values, including only proteins quantified in all pools. For correlation analysis of individual samples, we selected all 2,842 proteins for which proteotypic peptides were quantified by OpenSWATH and performed Spearman’s correlation among samples based on log2 protein intensities.