summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJian Zhen <zhenjl@gmail.com>2015-01-21 07:19:47 -0800
committerJian Zhen <zhenjl@gmail.com>2015-01-21 07:19:47 -0800
commit7aba042e03d7008f853a9ca92cdecbd891ba1d61 (patch)
tree860a69d43f72e354f24a045c11630f97662f929d
parent5544009e5c2ef0506300f908bb65ca10e4409d25 (diff)
more doc updates
-rw-r--r--cmd/suffixfsm/README.md8
-rw-r--r--porter2.go25
2 files changed, 32 insertions, 1 deletions
diff --git a/cmd/suffixfsm/README.md b/cmd/suffixfsm/README.md
new file mode 100644
index 0000000..dba28e7
--- /dev/null
+++ b/cmd/suffixfsm/README.md
@@ -0,0 +1,8 @@
+suffixfsm
+=========
+
+suffixfsm is a finite state machine generator for the [porter2](https://github.com/surgebase/porter2). It takes a list of suffixes, creates a tree, and then generates a FSM based on the tree.
+
+You can run the tool by `go run suffixfsm.go <filename>`.
+
+The output is a function skeleton for each of the suffix lists. Then you can take the output and customize it. \ No newline at end of file
diff --git a/porter2.go b/porter2.go
index 4d1f3b7..ee8ad56 100644
--- a/porter2.go
+++ b/porter2.go
@@ -14,10 +14,33 @@
// Porter2 implements the english Porter2 stemmer. It is written completely using
// finite state machines to do suffix comparison, rather than the string-based
-// or tree-based approaches. As a result, it is 600% faster compare to string-based
+// or tree-based approaches. As a result, it is 660% faster compare to string-based
// implementations.
//
// http://snowball.tartarus.org/algorithms/english/stemmer.html
+//
+// This implementation has been successfully validated with the dataset from
+// http://snowball.tartarus.org/algorithms/english/
+//
+// Most of the implementations rely completely on suffix string comparison.
+// Basically there's a list of suffixes, and the code will loop through the
+// list to see if there's a match. Given most of the time you are looking for
+// the longest match, so you order the list so the longest is the first one.
+// So if you are luckly, the match will be early on the list. But regardless
+// that's a huge performance hit.
+//
+// This implementation is based completely on finite state machines to perform
+// suffix comparison. You compare each chacter of the string starting at the
+// last character going backwards. The state machines at each step will
+// determine what the longest suffix is. You can think of the state machine
+// as an unrolled tree.
+//
+// However, writing large state machines can be very error-prone. So I wrote
+// a [quick tool](https://github.com/surgebase/porter2/tree/master/cmd/suffixfsm)
+// to generate most of the state machines. The tool basically takes a file of
+// suffixes, creates a tree, then unrolls the tree by dumping each of the nodes.
+//
+// You can run the tool by `go run suffixfsm.go <filename>`.
package porter2
import "unicode"