Data Process

How we collect and process publication data.

Discovery

New articles are discovered through the CrossRef API, which provides ground-truth metadata: DOIs, titles, publication dates, author lists, and citation counts. As journals publish new issues, articles are automatically queued for crawling.

Collection

Publisher-specific crawlers visit each article page to extract rich metadata that CrossRef doesn't provide: author affiliations, keywords, editorial information (senior editors, associate editors), abstracts, and PDF links. Crawlers rotate politely across publishers to respect rate limits and avoid overloading any single site.

Quality Assurance

Every article passes through automated validation that checks for required fields, data consistency, and extraction quality. Self-healing crawlers detect when publisher sites change and adapt accordingly. Issues are tracked and auto-resolved to maintain data accuracy over time.

Affiliation Processing

Raw affiliation text is parsed into structured fields (institution, department, school, city, country) using large language models. For journals that don't display affiliations on article pages, missing data is enriched via ORCID employment records and cross-journal author lookups.

Institutions are matched to the ROR (Research Organization Registry) for canonical identification, and business schools are matched to AACSB accreditation records. Duplicate organizations are detected through normalized and fuzzy name matching, confirmed by an LLM judge, then merged into a single canonical record.

Author Deduplication

The same researcher often appears under different name variations or across multiple institutions. A multi-signal scoring system merges duplicate records into canonical author entries using ORCID identifiers, email addresses, name similarity, organizational overlap, and co-author network signals.

Analytics

We conduct various analytics on the processed data. For example, LLM-based keyword extraction generates standardized keywords from article abstracts, enabling trend analysis across journals and time periods.

Data Issues

Data issues can arise from many sources: publisher websites may display incorrect information, crawlers may extract the wrong fields, and LLM parsing may misinterpret affiliations or other metadata. Users can report issues through our system, which automatically creates a public GitHub issue. AI agents then triage reports, diagnose root causes, apply fixes, validate the results, and comment with a summary of actions taken.

Notice something wrong? Report an issue to help improve the data.