Data Process
How we collect and process publication data.
Discovery
New articles are discovered through the CrossRef API, which provides ground-truth metadata: DOIs, titles, publication dates, author lists, and citation counts. As journals publish new issues, articles are automatically queued for crawling.
Collection
Publisher-specific crawlers visit each article page to extract rich metadata that CrossRef doesn't provide: author affiliations, keywords, editorial information (senior editors, associate editors), abstracts, and PDF links. Crawlers rotate politely across publishers to respect rate limits and avoid overloading any single site.
Quality Assurance
Every article passes through automated validation that checks for required fields, data consistency, and extraction quality. Self-healing crawlers detect when publisher sites change and adapt accordingly. Issues are tracked and auto-resolved to maintain data accuracy over time.
Affiliation Processing
Raw affiliation text is parsed into structured fields (institution, department, school, city, country) using large language models. For journals that don't display affiliations on article pages, missing data is enriched via ORCID employment records and cross-journal author lookups.
Institutions are matched to the ROR (Research Organization Registry) for canonical identification, and business schools are matched to AACSB accreditation records. Duplicate organizations are detected through normalized and fuzzy name matching, confirmed by an LLM judge, then merged into a single canonical record.
Author Deduplication
The same researcher often appears under different name variations or across multiple institutions. A multi-signal scoring system merges duplicate records into canonical author entries using ORCID identifiers, email addresses, name similarity, organizational overlap, and co-author network signals.
Analytics
We conduct various analytics on the processed data. For example, LLM-based keyword extraction generates standardized keywords from article abstracts, enabling trend analysis across journals and time periods.
Data Issues
Data issues can arise from many sources: publisher websites may display incorrect information, crawlers may extract the wrong fields, and LLM parsing may misinterpret affiliations or other metadata. Users can report issues through our system, which automatically creates a public GitHub issue. AI agents then triage reports, diagnose root causes, apply fixes, validate the results, and comment with a summary of actions taken.
Notice something wrong? Report an issue to help improve the data.