Developing Pretrained Language Models for Turkish Biomedical Domain

dc.authorscopusid57212949244
dc.authorscopusid6602384879
dc.authorscopusid24824386400
dc.authorscopusid57201307584
dc.authorscopusid7005245659
dc.contributor.authorTurkmen, Hazal
dc.contributor.authorDikenelli, Oguz
dc.contributor.authorEraslan, Cenk
dc.contributor.authorCalli, Mehmet Cem
dc.contributor.authorOzbek, Suha Sureyya
dc.date.accessioned2023-01-12T20:01:22Z
dc.date.available2023-01-12T20:01:22Z
dc.date.issued2022
dc.departmentN/A/Departmenten_US
dc.description10th IEEE International Conference on Healthcare Informatics (IEEE ICHI) -- JUN 11-14, 2022 -- Rochester, MNen_US
dc.description.abstractPretrained language models elevated with in-domain corpora show impressive results in biomedicine and clinical NLP tasks in English. However, there is minimal work in low-resource languages. This work introduces the BioBERTurk family, three pretrained models in Turkish for biomedicine. To evaluate models, we also introduce a labeled dataset to classify radiology reports of CT exams. Our first model was initialized from BERTurk and pretrained with biomedical corpus. The second model again continues to pretrain the general BERT model with a corpus of Ph.D. theses on radiology to test the effect of the task-related text. The final model combines radiology and biomedicine corpora with the corpus of BERTurk and pretrained a BERT model from scratch. F-scores of our models in the radiology resort classification are 92.99, 92.75, and 89.49 respectively. As far as we know, this is the first model that evaluates the effect of small size in-domain corpus in pretraining from scratch.en_US
dc.description.sponsorshipIEEE,Mayo Clin, Dept AI & Informat,Mayo Clin, Robert D & Patricia E Kern Ctr Sci Hlth Care Delivery,Mayo Clin Platform,NSF,IEEE Comp Soc Tech Comm Intelligent Informat,Journal Healthcare Informat Res,Hlth Data Scien_US
dc.description.sponsorshipTensorflow Research Cloud (TRC)en_US
dc.description.sponsorshipWe would like to acknowledge the support we received from the Tensorflow Research Cloud (TRC) team 4 in providing access to TPUv3 units.en_US
dc.identifier.doi10.1109/ICHI54592.2022.00117
dc.identifier.endpage598en_US
dc.identifier.isbn978-1-6654-6845-9
dc.identifier.issn2575-2634
dc.identifier.issn2575-2626
dc.identifier.issn2575-2634en_US
dc.identifier.issn2575-2626en_US
dc.identifier.scopus2-s2.0-85139013064en_US
dc.identifier.scopusqualityN/Aen_US
dc.identifier.startpage597en_US
dc.identifier.urihttps://doi.org/10.1109/ICHI54592.2022.00117
dc.identifier.urihttps://hdl.handle.net/11454/77466
dc.identifier.wosWOS:000864170400105en_US
dc.identifier.wosqualityN/Aen_US
dc.indekslendigikaynakWeb of Scienceen_US
dc.indekslendigikaynakScopusen_US
dc.language.isoenen_US
dc.publisherIeeeen_US
dc.relation.ispartof2022 Ieee 10th International Conference On Healthcare Informatics (Ichi 2022)en_US
dc.relation.publicationcategoryKonferans Öğesi - Uluslararası - Kurum Öğretim Elemanıen_US
dc.rightsinfo:eu-repo/semantics/closedAccessen_US
dc.subjectbiomedicineen_US
dc.subjectpretrained language modelen_US
dc.subjecttransformeren_US
dc.subjecttransfer learningen_US
dc.subjectradiology reportsen_US
dc.titleDeveloping Pretrained Language Models for Turkish Biomedical Domainen_US
dc.typeConference Objecten_US

Dosyalar