more doc updates

author: Jian Zhen <zhenjl@gmail.com> 2015-01-21 07:19:47 -0800
committer: Jian Zhen <zhenjl@gmail.com> 2015-01-21 07:19:47 -0800
commit: 7aba042e03d7008f853a9ca92cdecbd891ba1d61 (patch)
tree: 860a69d43f72e354f24a045c11630f97662f929d
parent: 5544009e5c2ef0506300f908bb65ca10e4409d25 (diff)
2 files changed, 32 insertions, 1 deletions
diff --git a/cmd/suffixfsm/README.md b/cmd/suffixfsm/README.md
new file mode 100644
index 0000000..dba28e7
--- /dev/null
+++ b/cmd/suffixfsm/README.md
@@ -0,0 +1,8 @@
+suffixfsm
+=========
+
+suffixfsm is a finite state machine generator for the [porter2](https://github.com/surgebase/porter2). It takes a list of suffixes, creates a tree, and then generates a FSM based on the tree. 
+
+You can run the tool by `go run suffixfsm.go <filename>`.
+
+The output is a function skeleton for each of the suffix lists. Then you can take the output and customize it.
+\ No newline at end of file
diff --git a/porter2.go b/porter2.go
index 4d1f3b7..ee8ad56 100644
--- a/porter2.go
+++ b/porter2.go
@@ -14,10 +14,33 @@
 
 // Porter2 implements the english Porter2 stemmer. It is written completely using
 // finite state machines to do suffix comparison, rather than the string-based
-// or tree-based approaches. As a result, it is 600% faster compare to string-based
+// or tree-based approaches. As a result, it is 660% faster compare to string-based
 // implementations.
 //
 // http://snowball.tartarus.org/algorithms/english/stemmer.html
+//
+// This implementation has been successfully validated with the dataset from
+// http://snowball.tartarus.org/algorithms/english/
+//
+// Most of the implementations rely completely on suffix string comparison.
+// Basically there's a list of suffixes, and the code will loop through the
+// list to see if there's a match. Given most of the time you are looking for
+// the longest match, so you order the list so the longest is the first one.
+// So if you are luckly, the match will be early on the list. But regardless
+// that's a huge performance hit.
+//
+// This implementation is based completely on finite state machines to perform
+// suffix comparison. You compare each chacter of the string starting at the
+// last character going backwards. The state machines at each step will
+// determine what the longest suffix is. You can think of the state machine
+// as an unrolled tree.
+//
+// However, writing large state machines can be very error-prone. So I wrote
+// a [quick tool](https://github.com/surgebase/porter2/tree/master/cmd/suffixfsm)
+// to generate most of the state machines. The tool basically takes a file of
+// suffixes, creates a tree, then unrolls the tree by dumping each of the nodes.
+//
+// You can run the tool by `go run suffixfsm.go <filename>`.
 package porter2
 
 import "unicode"
author	Jian Zhen <zhenjl@gmail.com>	2015-01-21 07:19:47 -0800
committer	Jian Zhen <zhenjl@gmail.com>	2015-01-21 07:19:47 -0800
commit	7aba042e03d7008f853a9ca92cdecbd891ba1d61 (patch)
tree	860a69d43f72e354f24a045c11630f97662f929d
parent	5544009e5c2ef0506300f908bb65ca10e4409d25 (diff)