POD2::JA::Unicode::LineBreak - UAX #14 Unicode 行分割アルゴリズム

NAME  SYNOPSIS  DESCRIPTION  ç¨èª  PUBLIC INTERFACE  è¡ã®åå²  æå ±ã®åå¾  ãªãã·ã§ã³  å®æ°  CUSTOMIZATION  è¡ã®æ´å½¢  ã¦ã¼ã¶å®ç¾©ã®è¡åå²åä½  æååé·ã®ç®åº  æåã®ç¹æ§ã®æç´ã  è¨å®ãã¡ã¤ã«  BUGS  VERSION  éäºæãªå¤æ´  æ¨æºã¸ã®é©å槠 IMPLEMENTATION NOTES  REFERENCES  SEE ALSO  AUTHOR 

NAME

Unicode::LineBreak˜[ja] − UAX #14 Unicode è¡åå²ã¢ã«ã´ãªãºã

SYNOPSIS

use Unicode::LineBreak;
$lb = Unicode::LineBreak−>new();
$broken = $lb−>break($string);

DESCRIPTION

Unicode::LineBreak ã¯ãUnicode æ¨æºã®éå±æ¸14 [UAX #14] ã§è¿°ã¹ã Unicode è¡åå²ã¢ã«ã´ãªãºã ãå®è¡ããã åå²ä½ç½®ã決å®ããéã«ãéå±æ¸11 [UAX #11] ã§å®ç¾©ããã East_Asian_Width åèç¹æ§ãèæ®ããã

ç¨èª

便å®çã«ä»¥ä¸ã®ç¨èªã使ãã

å¼·å¶åå²ãmandatory breakãã¯ãåºæ¬è¦åã§å®ãããã¦ãããå¨å²ã®æå- ã«é¢ä¿ãªã義åçã«å®è¡ãããè¡åå²åä½ã ä»»æåå²ã¯ãåºæ¬è¦åã§èªãããã¦ãããã¦ã¼ã¶ãå®è¡ããã¨æ±ºããå ´åã«è¡ãããè¡åå²åä½ã [UAX #14] ã§å®ç¾©ãããä»»æåå²ã«ã¯ç´æ¥åå²ãdirect breakãã¨éæ¥åå²ãindirect breakãã¨ãããã

é³ç´ æåçãªæåãalphabetic charactersãã¯ãé常ãä»ã®æå- ãåå²ã®æ©ä¼ãä¸ããªãããããæåå士ã®éã§è¡åå²ã§ããªãæåã 表èªæåçãªæåãideographic charactersãã¯ãé常ããã®åå¾ã§è¡åå²ã§ããæåã [UAX #14] ã§ã¯é³ç´ æåçãªæå- ã®ã»ã¨ãã©ã AL ã«ã表èªæåçãªæåã®ã»ã¨ãã©ã ID ã«åé¡ãã¦ãã (ãããã®ç¨èªã¯æåå- ¦ã®è¦³ç¹ããããã°ä¸æ£ç¢ºã§ãã)ã è¥å¹²ã®ç¨åç³»ã§ã¯ãåãã®æå- ããã¯åå²ä½ç½®ãæ確ã«ãªããªããããè¾æ¸ã«ããçºè¦çæ¹æ³ãç¨ããã

æååã®æ¡æ°ã¯ãæååã«å«ã¾ããæåã®æ°ã¨çããã¨ã¯ããããªãã åãã®æåã¯åºããwideãããç- ããnarrowãããåé²ãä¼´ããªããnonspacingããã®ããããã§ãããåã 2 æ¡ã1 æ¡ã0 æ¡ãå ããã è¥å¹²ã®æåã¯ã使ãããæèã«ãã£ã¦åºããçãããªãå¾ãã ã«ã¹ã¿ãã¤ãºã«ãã£ã¦ãæå- ã¯ããå¤æ§ãªå¹ãæã¡ããã

PUBLIC INTERFACE

è¡ã®åå²

new ([KEY => VALUE, ...])

ã³ã³ã¹ãã©ã¯ã¿ã KEY => VALUE ã®å¯¾ã«ã¤ãã¦ã¯ "ãªãã·ã§ã³" ãåç§ã

break (STRING)

ã¤ã³ã¹ã¿ã³ã¹ã¡ã½ããã Unicode æåå STRING ãåå²ãããããè¿ãã éåã³ã³ãã¯ã¹ãã§ã¯ãçµæã®åè¡ã®éåãè¿ãã

break_partial (STRING)

ã¤ã³ã¹ã¿ã³ã¹ã¡ã½ããã break() ã¨åãã ããæååãå°ããã¤è¿½å ãã¦å¥åããå ´åã å¥åãå®äºãããã¨ã示ãã«ã¯ãSTRING å¼æ°ã« "undef" ãä¸ããã

config (KEY)
config (KEY => VALUE, ...)

ã¤ã³ã¹ã¿ã³ã¹ã¡ã½ããã è¨å®ãåå¾ã¾ãã¯å¤æ´ããã KEY => VALUE ã®å¯¾ã«ã¤ãã¦ã¯ "ãªãã·ã§ã³" ãåç§ã

copy

ã³ãã¼ã³ã³ã¹ãã©ã¯ã¿ã ãªãã¸ã§ã¯ãã¤ã³ã¹ã¿ã³ã¹ã®è¤è£½ãã¤ããã

æå ±ã®åå¾

breakingRule (BEFORESTR, AFTERSTR)

ã¤ã³ã¹ã¿ã³ã¹ã¡ã½ããã æåå BEFORESTR 㨠AFTERSTR ã®éã§ã®è¡åå²åä½ãå¾ãã è¿å¤ã«ã¤ãã¦ã¯ "å®æ°" ãåç§ã

注: ãã®ã¡ã½ããã¯ãè¡åå²ã®ããã¾ããªåä½ã表ãå¤ãè¿ãã«ãããªãã å®éã®ãã- ã¹ããè¡æãããã«ã¯ãbreak() çã®ã¡ã½ããã使ã£ã¦ã»ããã

context ([Charset => CHARSET], [Language => LANGUAGE])

é¢æ°ã ãã£ã©ã¯ã¿ã»ãã CHARSET ããã³è¨èªã³ã¼ã LANGUAGE ãããããã使ãè¨èª/å°åã®æèãå¾ãã

ãªãã·ã§ã³

"new"ã"config" ã®ä¸¡ã¡ã½ããã«ã¯ä»¥ä¸ã®å¯¾ãæå®ã§ããã æ¡æ°ã®ç®åº ([E])ãæ¸è¨ç´ ã¯ã©ã¹ã¿åç¯ ([G]) (Unicode::GCString˜[ja] ãåç§)ãè¡åå²åä½ ([L]) ã«å½±é¿ãããã®ãããã
BreakIndent => "YES" | "NO"

[L] è¡é ã® SPACE ã®ä¸¦ã³ (ã¤ã³ãã³ã) ã®å¾ã§ã¯å¸¸ã«åå²ã許ãã [UAX #14] 㯠SPACE ã®ãã®ãããªç¨æ³ãèæ®ãã¦ããªãã åæå¤ã¯ "YES"ã

注: ãã®ãªãã·ã§ã³ã¯ãªãªã¼ã¹ 1.011 ã§å°å¥ãããã

CharMax => NUMBER

[L] è¡ã«å«ã¿ããæ大ã®æåæ°ãè¡æ«ã®ç©ºç½æåã¨æ¹è¡ã®æååãé¤ãã æå- æ°ã¯ä¸è¬ã«è¡ã®é·ãã表ããªããã¨ã«æ³¨æã åæå¤ã¯ 998ã 0 ã«ã¯ã§ããªãã

ColMin => NUMBER

[L] ä»»æåå²ãããè¡ã®ãæ¹è¡ã®æååã¨è¡æ«ã®ç©ºç½æåãå«ããªãæå°æ¡æ°ã åæå¤ã¯ 0ã

ColMax => NUMBER

[L] è¡ã®ãæ¹è¡ã®æååã¨è¡æ«ã®ç©ºç½æåãå«ããªãæ大æ¡æ°ãã¤ã¾ããè¡ã®æ大é·ã åæå¤ã¯ 76ã

"Urgent" ãªãã·ã§ã³ããã³ "ã¦ã¼ã¶å®ç¾©ã®è¡åå²åä½" ãåç§ã
ComplexBreaking => "YES" | "NO"

[L] æ±åã¢ã¸ã¢ã®è¤éãªæèã§ãçºè¦çãªè¡æããè¡ãã åæå¤ã¯ãæ±åã¢ã¸ã¢ã®è¡¨è¨ä½ç³»ã§ã®åèªåç¯ãæå¹ãªã "YES"ã

Context => CONTEXT

[E][L] è¨èª/å°åã®æèãæå®ããã ç¾å¨ä½¿ããæè㯠"EASTASIAN" ã "NONEASTASIAN"ã åæã®æè㯠"NONEASTASIAN"ã

"EASTASIAN" æèã§ã¯ãEast_Asian_Width ç¹æ§ãææ§ (A) ã§ããã°ãåºããæå- ã¨ã¿ãªããè¡åå²ç¹æ§ã AI ã§ããã°è¡¨èªæåç (ID) ã¨ã¿ãªãã

"NONEASTASIAN" æèã§ã¯ãEast_Asian_Width ç¹æ§ãææ§ (A) ã§ããã°ãçããæå- ã¨ã¿ãªããè¡åå²ç¹æ§ã AI ã§ããã°é³ç´ æåç (AL) ã¨ã¿ãªãã

EAWidth => "[" ORD "=>" PROPERTY "]"
EAWidth => "undef"

[E] åãã®æåã® East_Asian_Width ç¹æ§ãæç´ãããã ORD ã¯æåã® UCS ã¤ã³ãã¯ã¹å¤ãããããã®éåã¸ã®åç§ã PROPERTY 㯠East_Asian_Width ç¹æ§å¤ãæ¡å¼µå¤ã®ãããã ("å®æ°" ãåç§)ã ãã®ãªãã·ã§ã³ã¯è¤æ°åæå®ã§ããã "undef" ãæå®ããã¨ãããã¾ã§ã®æç´ãããã¹ã¦åãæ¶ãã

åæå¤ã§ã¯ãEast_Asian_width ç¹æ§ã®æç´ãã¯ããªãã "æåã®ç¹æ§ã®æç´ã" ãåç§ã

Format => METHOD

[L] åå²ããè¡ãæ´å½¢ããæ¹æ³ãæå®ããã
"SIMPLE"

åæã®æ¹æ³ã ä»»æåå²ã®ä½ç½®ã«æ¹è¡ãæ¿å¥ããã ãã

"NEWLINE"

"Newline" ãªãã·ã§ã³ã§æå®ãããã®ã§æ¹è¡ãç½®ãæããã æ¹è¡ã®åã¨ãã- ã¹ãçµç«¯ã®ç©ºç½æåãé¤å»ããã ããã¹ãçµç«¯ã«æ¹è¡ããªããã°è¿½å ããã

"TRIM"

ä»»æåå²ã®ä½ç½®ã«æ¹è¡ãæ¿å¥ããã æ¹è¡ã®åã®ç©ºç½æåãé¤å»ããã

"undef"

ãªã«ãããªã (æ¹è¡ã®æ¿å¥ã)ã

ãµãã«ã¼ãã³ã¸ã®åç§

"è¡ã®æ´å½¢" ãåç§ã

HangulAsAL => "YES" | "NO"

[L] ãã³ã°ã«é³ç¯ã¨ãã³ã°ã«é£çµãã£ã¢ãconjoining jamoããé³ç´ æåçãªæå (AL) ã¨æ±ãã åæå¤ã¯ "NO"ã

LBClass => "[" ORD "=>" CLASS "]"
LBClass => "undef"

[G][L] åãã®æåã®è¡åå²ç¹æ§ (åé¡) ãæç´ãããã ORD ã¯æåã® UCS ã¤ã³ãã¯ã¹å¤ãããããã®éåã¸ã®åç§ã CLASS ã¯è¡åå²ç¹æ§å¤ã®ãããã ("å®æ°" ãåç§)ã ãã®ãªãã·ã§ã³ã¯è¤æ°åæå®ã§ããã "undef" ãæå®ããã¨ãããã¾ã§ã®æç´ãããã¹ã¦åãæ¶ãã

åæå¤ã§ã¯ãè¡åå²ç¹æ§ã®æç´ãã¯ããªãã "æåã®ç¹æ§ã®æç´ã" ãåç§ã

LegacyCM => "YES" | "NO"

[G][L] åã«ç©ºç½æåãã¤ããçµåæåãåç¬ã®çµåæå (ID) ã¨æ±ãã Unicode 5.0 çããã¯ã空ç½æåã®ãã®ãããªä½¿ãããã¯æ¨å¥¨ãããªãã åæå¤ã¯ "YES"ã

Newline => STRING

[L] æ¹è¡ã®æååã¨ãã Unicode æååã åæå¤ã¯ "\n"ã

Prep => METHOD

[L] ã¦ã¼ã¶å®ç¾©ã®è¡åå²åä½ã追å ããã ãã®ãªãã·ã§ã³ã¯è¤æ°åæå®ã§ããã METHOD ã«ã¯ä»¥ä¸ã®ãã®ãæå®ã§ããã
"NONBREAKURI"

URI ãåå²ããªãã

"BREAKURI"

URI ããå°å·ç©ã«é©ããè¦åã§åå²ããã 詳ãã㯠[CMOS] ã® 6.17 ç¯ã¨ 17.11 ç¯ãåç§ã

"[" REGEX, SUBREF "]"

æ£è¦è¡¨ç¾ REGEX ã«ãããããæååããSUBREF ã§åç§ããããµãã«ã¼ãã³ã§åå²ããã 詳細㯠"ã¦ã¼ã¶å®ç¾©ã®è¡åå²åä½" ãåç§ã

"undef"

ããã¾ã§ã«è¿½å ããåä½ããã¹ã¦åãæ¶ãã

Sizing => METHOD

[L] æååã®é·ããç®åºããæ¹æ³ãæå®ããã 以ä¸ã®ãªãã·ã§ã³ã使ããã
"UAX11"

åæã®æ¹æ³ã çµã¿è¾¼ã¿ã®æåãã¼ã¿ãã¼ã¹ã«ãã£ã¦æåã®æ¡æ°ãç®åºããã

"undef"

æååã«å«ã¾ããæ¸è¨ç´ ã¯ã©ã¹ã¿ (Unicode::GCString åç§) ã®æ°ãè¿ãã

ãµãã«ã¼ãã³ã¸ã®åç§

"æååé·ã®ç®åº" ãåç§ã

"ColMax"ã"ColMin"ã"EAWidth" ãªãã·ã§ã³ãåç§ã

Urgent => METHOD

[L] é·ãããè¡ã®æ±ããããæå®ããã 以ä¸ã®ãªãã·ã§ã³ã使ããã
"CROAK"

ã¨ã©ã¼ã¡ãã»ã¼ã¸ãåºåãã¦æ»ã¬ã

"FORCE"

é·ãããæååãç¡çããåå²ããã

"undef"

åæã®æ¹æ³ã é·ãããæååãåå²ããªãã

ãµãã«ã¼ãã³ã¸ã®åç§

"ã¦ã¼ã¶å®ç¾©ã®è¡åå²åä½" ãåç§ã

ViramaAsJoiner => "YES" | "NO"

[G] ã´ã£ã©ã¼ãè¨å· (ãã³ãã£èªã§ã¯ããã©ã³ãããã¯ã¡ã¼ã«æåã§ã®ãèã) ã¨ããã«ç¶ãåã¨ãåé¢ããªãã åæå¤ã¯ "YES"ã 注: ãã®ãªãã·ã§ã³ã¯ãªãªã¼ã¹ 2011.001_29 ã§å°å¥ãããã 以åã®ãªãªã¼ã¹ã§ã¯ "NO" ã«åºå®ã§ãã£ãã ããã¯ã[UAX #29] ã§å®ç¾©ãããåæã®ãæ¸è¨ç´ ã¯ã©ã¹ã¿ã«ã¯å«ã¾ããªãä»æ§ã§ããã

å®æ°

"EA_Na", "EA_N", "EA_A", "EA_W", "EA_H", "EA_F"

[UAX #11] ã§å®ç¾©ããã 6 ã¤ã® East_Asian_Width ç¹æ§å¤ã ç (Na)ãä¸ç« (N)ãææ§ (A)ãåº (W)ãåè§ (H)ãå¨è§ (F)ã

"EA_Z"

åé²ãä¼´ããªãæåã® East_Asian_Width ç¹æ§ã®å¤ã

注: ãã®ãåé²ãä¼´ããªããå¤ã¯å½ã¢ã¸ã¥ã¼ã«ã«ããæ¡å¼µã§ããã [UAX #11] ã®ä¸é¨ã§ã¯ãªãã

"LB_BK", "LB_CR", "LB_LF", "LB_NL", "LB_SP", "LB_OP", "LB_CL", "LB_CP",
"LB_QU", "LB_GL", "LB_NS", "LB_EX", "LB_SY", "LB_IS", "LB_PR", "LB_PO",
"LB_NU", "LB_AL", "LB_HL", "LB_ID", "LB_IN", "LB_HY", "LB_BA", "LB_BB",
"LB_B2", "LB_CB", "LB_ZW", "LB_CM", "LB_WJ", "LB_H2", "LB_H3", "LB_JL",
"LB_JV", "LB_JT", "LB_SG", "LB_AI", "LB_CJ", "LB_SA", "LB_XX", "LB_RI"

[UAX #14] ã§å®ç¾©ããã 40 ã®è¡åå²ç¹æ§å¤ (åé¡)ã

注: ç¹æ§å¤ CP ã¯Unicode 5.2.0çã§å°å¥ãããã ç¹æ§å¤ HL 㨠CJ ã¯Unicode 6.1.0çã§å°å¥ãããã ç¹æ§å¤ RI 㯠Unicode 6.2.0çã§å°å¥ãããã

"MANDATORY", "DIRECT", "INDIRECT", "PROHIBITED"

è¡åå²åä½ã表ã 4 ã¤ã®å¤ã å¼·å¶åå²ãç´æ¥åå²ãéæ¥åå²ãèªãããéæ¥åå²ãèªãããç´æ¥åå²ã¯ç¦ãããåå²ãç¦ããã

"Unicode::LineBreak::SouthEastAsian::supported"

æ±åã¢ã¸ã¢ã®è¡¨è¨ä½ç³»ã®ããã®åèªåç¯æ©è½ãæå¹ãã©ããã示ããã©ã°ã ãã®æ©è½ãæå¹ã«ãªã£ã¦ããã°ã空ã§ãªãæååã ããã§ãªããã° "undef"ã

注: ç¾ãªãªã¼ã¹ã§ã¯ç¾ä»£ã¿ã¤èªã®ã¿ã¤æåã«ã®ã¿å¯¾å¿ãã¦ããã

"UNICODE_VERSION"

ãã®ã¢ã¸ã¥ã¼ã«ãåç§ãã Unicode æ¨æºã®çã示ãæååã

CUSTOMIZATION

è¡ã®æ´å½¢

"Format" ãªãã·ã§ã³ã«ãµãã«ã¼ãã³ã¸ã®åç§ãæå®ããå ´åããã®ãµãã«ã¼ãã³ã¯ 3 ã¤ã®å¼æ°ãåããªããã°ãªããªãã

$ä¿®æ£å¾ = &ãµãã«ã¼ãã³(SELF, EVENT, STR);

SELF 㯠Unicode::LineBreak ãªãã¸ã§ã¯ããEVENT ã¯ãµãã«ã¼ãã³ãå¼ã°ããæèã表ãæååãSTR ã¯åå²ä½ç½®ã®åã¾ãã¯å¾ã® Unicode æå- åã®æçã

EVENT |é§åã®å¥æ© |STR
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
"sot" |ããã¹ãåé  |æåã®è¡ã®æç
"sop" |å¼·å¶åå²ã®å¾ |次ã®è¡ã®æç
"sol" |ä»»æåå²ã®å¾ |ç¶ãã®è¡ã®æç
"" |åå²ã®ç´å |è¡å¨ä½ (çµç«¯ã®ç©ºç½æåãé¤ã)
"eol" |ä»»æåå² |åå²ä½ç½®ã®åã®ç©ºç½æå
"eop" |å¼·å¶åå² |æ¹è¡ã¨ãã®åã®ç©ºç½æå
"eot" |ããã¹ãçµç«¯ |ããã¹ãçµç«¯ã®ç©ºç½æå (ã¨æ¹è¡)
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

ãµãã«ã¼ãã³ã¯ãããã¹ãã®æçãä¿®æ£ãã¦è¿ããªããã°ãªããªãããªã«ãä¿®æ- £ããªãã£ããã¨ã示ãã«ã¯ã"undef" ãè¿ãã°ããã ãªãã"sot"ã"sop"ã"sol" ã®æèã§ã®ä¿®æ£ã¯ãã®å¾ã®åå²ä½ç½®ã®æ±ºå®ã«å½±é¿ããããã»ãã®æèã§ã®ä¿®æ- £ã¯å½±é¿ããªãã

注æ: æååã®å¼æ°ã¯å®éã«ã¯æ¸è¨ç´ ã¯ã©ã¹ã¿åã§ããã Unicode::GCString˜[ja] åç§ã

ãã¨ãã°æ¬¡ã®ã³ã¼ãã¯ãè¡æ«ã®ç©ºç½ãåãé¤ãã¦è¡æããããã

sub fmt {
if ($_[1] =˜ /ˆeo/) {
return "\n";
}
return undef;
}
my $lb = Unicode::LineBreak−>new(Format => \&fmt);
$output = $lb−>break($text);

ã¦ã¼ã¶å®ç¾©ã®è¡åå²åä½

ä»»æåå²ã«ãã£ã¦çããè¡ã CharMaxãColMaxãColMin ã®ããããã®å¶éãè¶ããã¨è¦è¾¼ã¾ããã¨ãã¯ãå¼ãç¶ãæååã«å¯¾ãã¦ç·æ¥åå²ãå®è¡ã§ããã "Urgent" ãªãã·ã§ã³ã«ãµãã«ã¼ãã³ã¸ã®åç§ãæå®ããå ´åããã®ãµãã«ã¼ãã³ã¯ 2 ã¤ã®å¼æ°ãåããªããã°ãªããªãã

@åå²å¾ = &ãµãã«ã¼ãã³(SELF, STR);

SELF 㯠Unicode::LineBreak ãªãã¸ã§ã¯ããSTR ã¯åå²ãã¹ã Unicode æååã

ãµãã«ã¼ãã³ã¯ãæåå STR ãåå²ããçµæã®éåãè¿ããªããã°ãªããªãã

注æ: æååã®å¼æ°ã¯å®éã«ã¯æ¸è¨ç´ ã¯ã©ã¹ã¿åã§ããã Unicode::GCString˜[ja] åç§ã

ãã¨ãã°æ¬¡ã®ã³ã¼ãã¯ãè¥å¹²ã®åå¦ç©è³ª (ããã³ã®ãããª) ã®å称ã«ãã¤ãã³ãæ¿å¥ããè¡æãã§ããããã«ããã

sub hyphenize {
return map {$_ =˜ s/yl$/yl−/; $_} split /(\w+?yl(?=\w))/, $_[1];
}
my $lb = Unicode::LineBreak−>new(Urgent => \&hyphenize);
$output = $lb−>break("Methionylthreonylthreonylglutaminylarginyl...");

"Prep" ãªãã·ã§ã³ã« [REGEX, SUBREF] ã®éååç§ãæå®ããå ´åããµãã«ã¼ãã³ã¯ 2 ã¤ã®å¼æ°ãåããªããã°ãªããªãã

@åå²å¾ = &ãµãã«ã¼ãã³(SELF, STR);

SELF 㯠Unicode::LineBreak ãªãã¸ã§ã¯ããSTR 㯠REGEX ã«ãããããåå²ãã¹ã Unicode æååã

ãµãã«ã¼ãã³ã¯ãæåå STR ãåå²ããçµæã®éåãè¿ããªããã°ãªããªãã

ãã¨ãã°æ¬¡ã®ã³ã¼ãã¯ãHTTP URL ã [CMOS] ã®è¦åãç¨ãã¦åå²ããã

my $url = qr{http://[\x21−\x7E]+}i;
sub breakurl {
my $self = shift;
my $str = shift;
return split m{(?<=[/]) (?=[ˆ/]) |
(?<=[ˆ−.]) (?=[−˜.,_?\#%=&]) |
(?<=[=&]) (?=.)}x, $str;
}
my $lb = Unicode::LineBreak−>new(Prep => [$url, \&breakurl]);
$output = $lb−>break($string);

ç¶æã®ä¿å

Unicode::LineBreak ãªãã¸ã§ã¯ãã¯ããã·ã¥åç§ã¨ãã¦ãµãã¾ãã ä»»æã®è¦ç´ ãããªãã¸ã§ã¯ãã®åå¨æéä¸ä¿åã§ããã

ãã¨ãã°æ¬¡ã®ã³ã¼ãã¯ã段è½ã空è¡ã§åããã

sub paraformat {
my $self = shift;
my $action = shift;
my $str = shift;
if ($action eq 'sot' or $action eq 'sop') {
$self−>{'line'} = '';
} elsif ($action eq '') {
$self−>{'line'} = $str;
} elsif ($action eq 'eol') {
return "\n";
} elsif ($action eq 'eop') {
if (length $self−>{'line'}) {
return "\n\n";
} else {
return "\n";
}
} elsif ($action eq 'eot') {
return "\n";
}
return undef;
}
my $lb = Unicode::LineBreak−>new(Format => \&paraformat);
$output = $lb−>break($string);

æååé·ã®ç®åº

"Sizing" ãªãã·ã§ã³ã«ãµãã«ã¼ãã³ã¸ã®åç§ãæå®ããå ´åããã®ãµãã«ã¼ãã³ã¯ 5 ã¤ã®å¼æ°ãåããªããã°ãªããªãã

$æ¡æ° = &ãµãã«ã¼ãã³(SELF, LEN, PRE, SPC, STR);

SELF 㯠Unicode::LineBreak ãªãã¸ã§ã¯ããLEN ã¯åè¡ããæååã®é·ããPRE ã¯åè¡ãã Unicode æååãSPC ã¯è¿½å ããã空ç½æåãSTR ã¯å¦çãã Unicode æååã

ãµãã«ã¼ãã³ã¯ "PRE.SPC.STR" ã®æ¡æ°ãç®åºãã¦è¿ããªããã°ãªããªãã æ¡æ°ã¯æ´æ°ã§ãªãã¦ããããæ¡æ°ã®åä½ã¯éæã«é¸ã¹ããã"ColMin" ãªãã·ã§ã³ããã³ "ColMax" ãªãã·ã§ã³ã®ããã¨ä¸è´ãããªããã°ãªããªãã

注æ: æååã®å¼æ°ã¯å®éã«ã¯æ¸è¨ç´ ã¯ã©ã¹ã¿åã§ããã Unicode::GCString˜[ja] åç§ã

ãã¨ãã°æ¬¡ã®ã³ã¼ãã¯ãè¡ã« 8 æ¡ãã¨ã®ã¿ãã¹ãããããããã®ã¨ãã¦å¦çããã

sub tabbedsizing {
my ($self, $cols, $pre, $spc, $str) = @_;
my $spcstr = $spc.$str;
while ($spcstr−>lbc == LB_SP) {
my $c = $spcstr−>item(0);
if ($c eq "\t") {
$cols += 8 − $cols % 8;
} else {
$cols += $c−>columns;
}
$spcstr = $spcstr−>substr(1);
}
$cols += $spcstr−>columns;
return $cols;
};
my $lb = Unicode::LineBreak−>new(LBClass => [ord("\t") => LB_SP],
Sizing => \&tabbedsizing);
$output = $lb−>break($string);

æåã®ç¹æ§ã®æç´ã

"LBClass" ãªãã·ã§ã³ããã³ "EAWidth" ãªãã·ã§ã³ã§åãã®æåã®è¡åå²ç¹æ§ (åé¡) ã East_Asian_Width ç¹æ§ãæç´ãã§ããããã®éã«ä¾¿å©ãªå®æ°ãããã¤ãå®ç¾©ãã¦ããã

è¡åå²ç¹æ§

ä»®åãªã©ã®è¡é ç¦åæå

åæå¤ã§ã¯ãè¥å¹²ã®ä»®åãä»®åã«æºãããã®ãè¡é ç¦åæå (NS ã¾ã㯠CJ) ã¨æ±ãã 以ä¸ã®å¯¾ã LBClass ãªãã·ã§ã³ã«æå®ããã°ããããã®æåãé常ã®è¡¨èªæåçãªæå (ID) ã¨æ±ããã
"KANA_NONSTARTERS() => LB_ID"

ä¸è¨ã®æåãã¹ã¦ã

"IDEOGRAPHIC_ITERATION_MARKS() => LB_ID"

表èªæåçãªç¹°ãè¿ãè¨å·ã U+3005 ç¹°è¿ãè¨å·ãU+303B ãããç¹ãU+309D 平仮åç¹°è¿ãè¨å·ãU+309E 平仮åç¹°è¿ãè¨å· (æ¿ç¹)ãU+30FD çä»®åç¹°è¿ãè¨å·ãU+30FE çä»®åç¹°è¿ãè¨å· (æ¿ç¹)ã

注ãä»®åã§ã¯ãªããã®ãããã

"KANA_SMALL_LETTERS() => LB_ID"
"KANA_PROLONGED_SOUND_MARKS() => LB_ID"

å°æ¸ãä»®åã å°æ¸ã平仮å U+3041 ã, U+3043 ã, U+3045 ã, U+3047 ã, U+3049 ã, U+3063 ã£, U+3083 ã, U+3085 ã, U+3087 ã, U+308E ã, U+3095 ã, U+3096 ãã å°æ¸ãçä»®å U+30A1 ã¡, U+30A3 ã£, U+30A5 ã¥, U+30A7 ã§, U+30A9 ã©, U+30C3 ã, U+30E3 ã£, U+30E5 ã¥, U+30E7 ã§, U+30EE ã®, U+30F5 ãµ, U+30F6 ã¶ã çä»®å表é³æ¡å¼µ U+31F0 ã° − U+31FF ã¿ã å°æ¸ãçä»®å (代æ¿å称) U+FF67 ァ − U+FF6F ッã

é·é³è¨å·ã U+30FC é·é³è¨å·ãU+FF70 é·é³è¨å· (代æ¿å称)ã

注ããããã®æåã¯è¡é ç¦åæåã¨æ±ããããã¨ããé常ã®è¡¨èªæåçãªæå- ã¨æ±ããããã¨ãããã[JIS X 4051] 6.1.1ã[JLREQ] 3.1.7 ã [UAX14] ãåç§ã

注ãU+3095 ã, U+3096 ã, U+30F5 ãµ, U+30F6 㶠ã¯ä»®åã§ã¯ãªãã¨ãããã

"MASU_MARK() => LB_ID"

U+303C ã¾ãè¨å·ã

注ããã®æåã¯ä»®åã§ã¯ãªãããé常 "ã¾ã" ã "ãã¹" ã®ç¥è¨ã¨ãã¦ç¨ããããã

注ããã®æå㯠[UAX #14] ã§ã¯è¡é ç¦åæå (NS) ã«åé¡ããããã[JIS X 4051] ã [JLREQ] ã§ã¯æåã¯ã©ã¹ (13) ã cl−19 (ID ã«ç¸å½) ã«åé¡ãããã

ææ§ãªå¼ç¨ç¬¦

åæå¤ã§ã¯ãè¥å¹²ã®è¨å·ãææ§ãªå¼ç¨ç¬¦ (QU) ã¨æ±ãã
"BACKWARD_QUOTES() => LB_OP, FORWARD_QUOTES() => LB_CL"

ããè¨èª (ãªã©ã³ãèªãè±èªãã¤ã¿ãªã¢èªããã«ãã¬ã«èªãã¹ãã¤ã³èªããã«ã³èªã ããã³æ±ã¢ã¸ã¢ã®å¤ãã®è¨èª) ã§ã¯ãéãè¨å·ã« 9 ãå転ããå½¢ç¶ã®å¼ç¨ç¬¦ (â â) ããéãè¨å·ã« 9 ã®å½¢ç¶ã®å¼ç¨ç¬¦ (â â) ãç¨ããã

"FORWARD_QUOTES() => LB_OP, BACKWARD_QUOTES() => LB_CL"

ã»ãã®è¨èª (ãã§ã³èªããã¤ãèªãã¹ãã´ã¡ã¯èª) ã§ã¯ã9 ã®å½¢ç¶ã®å¼ç¨ç¬¦ (â â) ãéãè¨å·ã«ã9 ãå転ããå½¢ç¶ã®å¼ç¨ç¬¦ (â â) ãéãè¨å·ã«ç¨ããã

"BACKWARD_GUILLEMETS() => LB_OP, FORWARD_GUILLEMETS() => LB_CL"

ãã©ã³ã¹èªãã®ãªã·ã£èªããã·ã¢èªãªã©ã§ã¯ãå·¦åãã®ã®ã¥ã¡ (« â¹) ãéãè¨å·ã«ãå³åãã®ã®ã¥ã¡ (» âº) ãéãè¨å·ã«ç¨ããã

"FORWARD_GUILLEMETS() => LB_OP, BACKWARD_GUILLEMETS() => LB_CL"

ãã¤ãèªãã¹ãã´ã¡ã¯èªã§ã¯ãå³åãã®ã®ã¥ã¡ (» âº) ãéãè¨å·ã«ãå·¦åãã®ã®ã¥ã¡ (« â¹) ãéãè¨å·ã«ç¨ããã

ãã¼ã³èªããã£ã³èªããã«ã¦ã§ã¼èªãã¹ã¦ã§ã¼ãã³èªã§ã¯ã9 ã®å½¢ç¶ã®å¼ç¨ç¬¦ã å³åãã®ã®ã¥ã¡ (â â » âº) ãéãè¨å·ã«ãéãè¨å·ã«ãç¨ããã

ååéé
"IDEOGRAPHIC_SPACE() => LB_BA"

U+3000 ååééãè¡é ã«æ¥ãªãããã«ããã ãããåæã®æåã§ããã

"IDEOGRAPHIC_SPACE() => LB_ID"

ååééãè¡é ã«æ¥ããã¨ãããã Unicode 6.2以åã¯ãããåæã®æåã§ãã£ãã

"IDEOGRAPHIC_SPACE() => LB_SP"

ååééãè¡é ã«æ¥ããè¡æ«ã§ã¯ã¯ã¿åºãããã«ããã

East_Asian_Width ç¹æ§

ã©ãã³ãã®ãªã·ã¢ãããªã«ã®åç¨åç³»ã§ã¯ãç¹å®ã®æåãææ§ (A) ã® East_Asian_Width ç¹æ§ãæã£ã¦ããããã®ããããããã£ãæå㯠"EASTASIAN" æèã§åºãæåã¨æ±ãããã "EAWidth => [ AMBIGUOUS_"*"() => EA_N ]" ã¨æå®ãããã¨ã§ããã®ãããªæåã常ã«çãæåã¨æ±ãã
"AMBIGUOUS_ALPHABETICS() => EA_N"

ä¸è¨ã®æåãã¹ã¦ã East_Asian_Width ç¹æ§ N (ä¸ç«) ã®æåã¨æ±ãã

"AMBIGUOUS_CYRILLIC() => EA_N"
"AMBIGUOUS_GREEK() => EA_N"
"AMBIGUOUS_LATIN() => EA_N"

ææ§ (A) ã®å¹ãæã¤ããªã«ãã®ãªã·ã¢ãã©ãã³ç¨åç³»ã®æåãä¸ç« (N) ã®æåã¨æ±ãã

ãã£ã½ããæ±ã¢ã¸ã¢ã®ç¬¦å·åæåéåã«å¯¾ããå¤ãã®å®è£ã§ãã³ãã³åºãæå- ã«æç»ããã¦ããã«ãããããããUnicode æ¨æºã§ã¯å¨è§ (F) ã®äºææåãæã¤ãããã«çã (Na) æåã¨ããã¦ããæåãè¥å¹²ãããEAWidth ãªãã·ã§ã³ã«ä»¥ä¸ã®ããã«æå®ãããã¨ã§ããããã®æåã "EASTASIAN" æèã§åºãæåã¨æ±ããã
"QUESTIONABLE_NARROW_SIGNS() => EA_A"

U+00A2 ã»ã³ãè¨å·ãU+00A3 ãã³ãè¨å·ãU+00A5 åè¨å· (ã¾ãã¯åè¨å·)ãU+00A6 ç ´æç·ãU+00AC å¦å®ãU+00AF ãã¯ãã³ã

è¨å®ãã¡ã¤ã«

"new" ã¡ã½ããããã³ "config" ã¡ã½ããã®ãªãã·ã§ã³å¼æ°ã®çµã¿è¾¼ã¿åæå¤ã¯ã è¨- å®ãã¡ã¤ã«ã§ä¸æ¸ãã§ããã Unicode/LineBreak/Defaults.pmã 詳細㯠Unicode/LineBreak/Defaults.pm.sample ãèªãã§ã»ããã

BUGS

ãã°ããã°ã®ãããªåä½ã¯ãéçºèã«æãã¦ãã ããã

CPAN Request Tracker: <http://rt.cpan.org/Public/Dist/Display.html?Name=Unicode−LineBreak>.

VERSION

$VERSION å¤æ°ãåç§ãã¦ã»ããã

éäºæãªå¤æ´

2012.06

eawidth() ã¡ã½ãããå»æ¢ããã 代ããã« "columns" in Unicode::GCString ã使ãããããããªãã

lbclass() ã¡ã½ãããå»æ¢ããã "lbc" in Unicode::GCString ã "lbcext" in Unicode::GCString ã使ã£ã¦ã»ããã

æ¨æºã¸ã®é©åæ§

ãã®ã¢ã¸ã¥ã¼ã«ã§ç¨ãã¦ããæåã®ç¹æ§å¤ã¯ãUnicode æ¨æº 8.0.0çã«ããã

ãã®ã¢ã¸ã¥ã¼ã«ã§ã¯ãå®è£æ°´æº UAX14−C2 ãå®è£ãã¦ããã¤ããã

IMPLEMENTATION NOTES

ä¸é¨ã®è¡¨èªæåçãªæåã NS ã¨ãã¦æ±ãã ID ã¨ãã¦æ±ãããé¸ã¹ãã

ãã³ã°ã«é³ç¯ããã³ãã³ã°ã«é£çµãã£ã¢ã ID ã¨ãã¦æ±ãã AL ã¨ãã¦æ±ãããé¸ã¹ãã

AI ã«åé¡ãããæåã AL 㨠ID ã®ã©ã¡ãã«è§£æ±ºããããé¸ã¹ãã

CB ã«åé¡ãããæåã¯è§£æ±ºããªãã

CJ ã«åé¡ãããæåã¯å¸¸ã« NS ã«è§£æ±ºãããããæè»ãªæç´ãã®æ©æ§ãæä¾ãããã

æ±åã¢ã¸ã¢ã®è¡¨è¨ä½ç³»ã®åèªåç¯ã«å¯¾å¿ããªãå ´åã¯ã SA ã«åé¡ãããæå㯠AL ã«è§£æ±ºããã ãã ããGrapheme_Cluster_Break ç¹æ§ã®å¤ã Extend ã SpacingMark ã§ããæå㯠CM ã«è§£æ±ºããã

SG ã XX ã«åé¡ãããæå㯠AL ã«è§£æ±ºããã

以ä¸ã® UCS ã®ç¯å²ã«ããã³ã¼ããã¤ã³ãã¯ãæå- ãå²ãå½ã¦ããã¦ããªãã¦ã決ã¾ã£ãç¹æ§å¤ãæã¤ã

ç¯å² | UAX #14 | UAX #11 | 説æ
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
U+20A0..U+20CF | PR [*1] | N [*2] | é貨è¨å·
U+3400..U+4DBF | ID | W | CJKæ¼¢å
U+4E00..U+9FFF | ID | W | CJKæ¼¢å
U+D800..U+DFFF | AL (SG) | N | ãµãã²ã¼ã
U+E000..U+F8FF | AL (XX) | F ã N (A) | ç§ç¨é å
U+F900..U+FAFF | ID | W | CJKæ¼¢å
U+20000..U+2FFFD | ID | W | CJKæ¼¢å
U+30000..U+3FFFD | ID | W | å¤æ¼¢å
U+F0000..U+FFFFD | AL (XX) | F ã N (A) | ç§ç¨é å
U+100000..U+10FFFD | AL (XX) | F ã N (A) | ç§ç¨é å
ãã®ä»æªå²ãå½ã¦ | AL (XX) | N | æªå²ãå½ã¦ã
| | | äºç´ãéæå
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
[*1] U+20A7 ãã»ã¿è¨å· (PO)ãU+20B6 ãã¥ã¼ã«ã»ãªã¼ã´ã«è¨å·
(PO)ãU+20BB ã¹ã«ã³ãã£ãã´ã£ã¢ã»ãã«ã¯è¨å· (PO)ãU+20BE
ã©ãªè¨å· (PO) ãé¤ãã
[*2] U+20A9 ã¦ã©ã³è¨å· (H)ãU+20AC ã¦ã¼ãè¨å· (F ã N (A)) ã
é¤ãã

ä¸è¬ã«ãã´ãªç¹æ§ã MnãMeãCcãCfãZlãZp ã®ããããã§ããæåã¯ãåé²ãä¼´ããªãæå- ã¨ã¿ãªãã

REFERENCES

[CMOS]

The Chicago Manual of Style, 15th edition. University of Chicago Press, 2003.

[JIS X 4051]

JIS X 4051:2004 æ¥æ¬èªææ¸ã®çµçæ¹æ³. æ¥æ¬è¦æ ¼åä¼, 2004.

[JLREQ]

é¿å康å®ä». æ¥æ¬èªçµçå¦çã®è¦ä»¶, W3C æè¡ãã¼ã 2012å¹´4æ3æ¥. <http://www.w3.org/TR/2012/NOTE−jlreq−20120403/ja/>.

[UAX #11]

A. Freytag (ed.) (2008−2009). Unicode Standard Annex #11: East Asian Width, Revisions 17−19. <http://unicode.org/reports/tr11/>.

[UAX #14]

A. Freytag and A. Heninger (eds.) (2008−2015). Unicode Standard Annex #14: Unicode Line Breaking Algorithm, Revisions 22−35. <http://unicode.org/reports/tr14/>.

[UAX #29]

Mark Davis (ed.) (2009−2013). Unicode Standard Annex #29: Unicode Text Segmentation, Revisions 15−23. <http://www.unicode.org/reports/tr29/>.

SEE ALSO

Text::LineFold˜[ja], Text::Wrap, Unicode::GCString˜[ja].

AUTHOR

Copyright (C) 2009−2018 Hatuka*nezumi − IKEDA Soji <hatuka(at)nezumi.nu>.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.


Updated 2024-01-29 - jenkler.se | uex.se