diff options
author | Jian Zhen <zhenjl@gmail.com> | 2015-01-21 07:19:47 -0800 |
---|---|---|
committer | Jian Zhen <zhenjl@gmail.com> | 2015-01-21 07:19:47 -0800 |
commit | 7aba042e03d7008f853a9ca92cdecbd891ba1d61 (patch) | |
tree | 860a69d43f72e354f24a045c11630f97662f929d | |
parent | 5544009e5c2ef0506300f908bb65ca10e4409d25 (diff) |
more doc updates
-rw-r--r-- | cmd/suffixfsm/README.md | 8 | ||||
-rw-r--r-- | porter2.go | 25 |
2 files changed, 32 insertions, 1 deletions
diff --git a/cmd/suffixfsm/README.md b/cmd/suffixfsm/README.md new file mode 100644 index 0000000..dba28e7 --- /dev/null +++ b/cmd/suffixfsm/README.md @@ -0,0 +1,8 @@ +suffixfsm +========= + +suffixfsm is a finite state machine generator for the [porter2](https://github.com/surgebase/porter2). It takes a list of suffixes, creates a tree, and then generates a FSM based on the tree. + +You can run the tool by `go run suffixfsm.go <filename>`. + +The output is a function skeleton for each of the suffix lists. Then you can take the output and customize it.
\ No newline at end of file @@ -14,10 +14,33 @@ // Porter2 implements the english Porter2 stemmer. It is written completely using // finite state machines to do suffix comparison, rather than the string-based -// or tree-based approaches. As a result, it is 600% faster compare to string-based +// or tree-based approaches. As a result, it is 660% faster compare to string-based // implementations. // // http://snowball.tartarus.org/algorithms/english/stemmer.html +// +// This implementation has been successfully validated with the dataset from +// http://snowball.tartarus.org/algorithms/english/ +// +// Most of the implementations rely completely on suffix string comparison. +// Basically there's a list of suffixes, and the code will loop through the +// list to see if there's a match. Given most of the time you are looking for +// the longest match, so you order the list so the longest is the first one. +// So if you are luckly, the match will be early on the list. But regardless +// that's a huge performance hit. +// +// This implementation is based completely on finite state machines to perform +// suffix comparison. You compare each chacter of the string starting at the +// last character going backwards. The state machines at each step will +// determine what the longest suffix is. You can think of the state machine +// as an unrolled tree. +// +// However, writing large state machines can be very error-prone. So I wrote +// a [quick tool](https://github.com/surgebase/porter2/tree/master/cmd/suffixfsm) +// to generate most of the state machines. The tool basically takes a file of +// suffixes, creates a tree, then unrolls the tree by dumping each of the nodes. +// +// You can run the tool by `go run suffixfsm.go <filename>`. package porter2 import "unicode" |