There are an increasing number of tools and resources for text analysis and natural language processing in Japanese, including part-of-speech taggers and tokenizers. Some of these are standalone pieces of software, while others can be used through APIs or even web interfaces. The resources also include historical and contemporary dictionaries, and corpora of texts from literature, contemporary and historical usage, and bilingual sources.
This python script strips Aozora bunko XHTML files of ruby as well as all XHTML tags, creating a plain text UTF-8 encoded file. It then segments the text using Tiny Segmenter and outputs a file with whitespace-delimited words. It also removes the Aozora metadata from the end of the file by looking for the string 底本 (so if you have a file with this in the body of the text, it will strip off anything after this word).
It runs on all .html files in its own directory, so put the script in the same directory as your files. The script then outputs the plain text UTF-8 file as (originalfilename).txt. Each paragraph is a single line of the file.
This was written by Molly Des Jardin; it's CC-0 so please use, alter, and redistribute freely.
Word segmentation, part-of-speech tagging, and more:
Check out the Center for Corpus Development, NINJAL for links to many corpora and databases. Here are some notable corpora.
WordNet, including synsets (synonym sets) only, has been created for Japanese. Please visit the page to download the sqlite3 database of Japanese WordNet, then use one of the APIs in a variety of programming languages to use it in your own code. Here is a link to the Python API for Japanese WordNet.
These are plain-text files formatted for use with the topic modeling software MALLET. They contain the title of the work, author, and year, followed by the text with words separated by spaces. I have provided both lemmatized and raw format text when available.