A corpus with 349k+ Nepali sentences for the purpose of NLP research

In recent years, natural language processing (NLP) has become a major topic of research. This topic spans a wide area of application such as creation of chatbots, sentiment analysis, named entity recognition, etc. With the advent of research in different techniques within this field, these tasks have found wide-spread implementation and progressively work to reduce the workload on human resources. The basic criterion for such data-driven techniques, however, is the data itself. Techniques based on deep learning consume huge amounts of data to achieve adequate accuracy within their tasks. The pursuit for such data therefore becomes a great challenge and even a bottleneck for further research. Open-sourced corpora for NLP, though progressively more common in recent years, still remain few and far between for ‘minority’ languages such as Nepali.

In order to address this issue, I am publishing this open-sourced corpora of almost 350k Nepali sentences that have been crawled through online news portals and undergone various cleaning processes. These processes include standardizing the usage of punctuation marks of data taken from these various news portals, removal of links, symbols and unneeded characters, removal of the first and last lines (these lines normally have issues and are usually phrases rather than actual sentences), random sorting of sentences, etc. I hope that this data is useful to instill greater progress in further researches and provide deeper insights into the Nepali language. I would like to thank Naya for their assistance in building this corpus.

Please find the dataset and other resources at this github repo (.txt, 89.6MB) DOI: 10.13140/RG.2.2.30708.68482 (ResearchGate)