Skip to main content
Click logo to go to Libraries homepage

Japanese Text Analysis: Overview

A guide to tools, corpora, and other resources related to Japanese text analysis and natural language processing, with a focus on the digital humanities.

Introduction

Help and Resources

Morphological Analysis

Software:

  • Voyant Tools now works with Japanese (including tokenizing your text)
  • Topic Modeling Tool also works out-of-the-box with pre-tokenized Japanese
  • ctext.org's Text Tools works with pre-tokenized Japanese (uncheck tokenize by character in the options)

APIs:

Keyword-in-Context:

Word segmentation, part-of-speech tagging, and more:

Geographic services:

  • GeoNLP geographic name services and software from NII
  • GeoNames (English place names only) downloadable and web-searchable gazetteer

Corpora

Literature:

Modern usage:

Bilingual:

NINJAL corpora:

Check out the Center for Corpus Development, NINJAL for links to many corpora and databases. Here are some notable corpora.

 

Dictionaries

Subject Guide

Japanese WordNet

WordNet, including synsets (synonym sets) only, has been created for Japanese. Please visit the page to download the sqlite3 database of Japanese WordNet, then use one of the APIs in a variety of programming languages to use it in your own code. Here is a link to the Python API for Japanese WordNet.

Corpora ready for MALLET

These are plain-text files formatted for use with the topic modeling software MALLET. They contain the title of the work, author, and year, followed by the text with words separated by spaces. I have provided both lemmatized and raw format text when available.