Skip to main content
Click logo to go to Libraries homepage

Japanese Text Analysis: Overview

A guide to tools, corpora, and other resources related to Japanese text analysis and natural language processing, with a focus on the digital humanities.

Introduction

There are an increasing number of tools and resources for text analysis and natural language processing in Japanese, including part-of-speech taggers and tokenizers. Some of these are standalone pieces of software, while others can be used through APIs or even web interfaces. The resources also include historical and contemporary dictionaries, and corpora of texts from literature, contemporary and historical usage, and bilingual sources.

Help and Resources

Stripping Aozora HTML files of ruby and tags

This python script strips Aozora bunko XHTML files of ruby as well as all XHTML tags, creating a plain text UTF-8 encoded file. It then segments the text using Tiny Segmenter and outputs a file with whitespace-delimited words. It also removes the Aozora metadata from the end of the file by looking for the string 底本 (so if you have a file with this in the body of the text, it will strip off anything after this word).

It runs on all .html files in its own directory, so put the script in the same directory as your files. The script then outputs the plain text UTF-8 file as (originalfilename).txt. Each paragraph is a single line of the file.

The script requires Beautiful Soup 4 and Tiny Segmenter, so please install them first according to the documentation. Then run by typing "python derubytokenize.py" in its directory.

This was written by Molly Des Jardin; it's CC-0 so please use, alter, and redistribute freely.

Morphological Analysis

Software:

  • Voyant Tools now works with Japanese (including tokenizing your text)

APIs:

Keyword-in-Context:

Word segmentation, part-of-speech tagging, and more:

Geographic services:

  • GeoNLP geographic name services and software from NII
  • GeoNames (English place names only) downloadable and web-searchable gazetteer

Corpora

Literature:

Modern usage:

Bilingual:

NINJAL corpora:

Check out the Center for Corpus Development, NINJAL for links to many corpora and databases. Here are some notable corpora.

Subject Guide

Molly Des Jardin
Contact:
Van Pelt 527
215-898-3205
Website / Blog Page

Japanese WordNet

WordNet, including synsets (synonym sets) only, has been created for Japanese. Please visit the page to download the sqlite3 database of Japanese WordNet, then use one of the APIs in a variety of programming languages to use it in your own code. Here is a link to the Python API for Japanese WordNet.

Corpora ready for MALLET

These are plain-text files formatted for use with the topic modeling software MALLET. They contain the title of the work, author, and year, followed by the text with words separated by spaces. I have provided both lemmatized and raw format text when available.