on the Curl Web Content Markup Language

on the Curl Web Content Markup and Programming Language from www.curl.com and www.curlap.com

Wednesday, April 25, 2012

Kanjidic2 as CSV in HTML and text


To aid a neutral party in assessing approaches to digital dictionaries for Japanese, I have posted an HTML file displaying 10,000+ of the first entries in Kanjidic2 at
  http://kanji.aule-browser.com/kanjidic2-m12.html
I have restricted the dump to the Kanji, the UCS code and a max of 12 of a possible 14 meanings.

There are less than 10,200 due to the fact that in the first 12,155 entries, many had no XML meaning content which was not assigned a language attribute.  Those few thousand may have English translations in markup previously used for foreign languages.

The file can be found as
  http://kanji.aule-browser.com/kanjidic2-m12.csv
with a three line header which you may have to alter for your purposes.

The Kanjidic2 XML file was parsed using the Curl XDM library from curl.com (Nihon-go http://www.curlap.com)

As it stands, the HTML file should be useful for building custom Anki flashcards (themselves stored as SQLite.)   I will be using variant CSV output to construct dictionary software with annotations and spaced-repetition options.  Curl has both CSV and SQLite libraries in addition to the XML libraries.



No comments:

Post a Comment