用Scala写了一个简单的分词工具
起了个名叫做segcala, 地址 http://segcala.googlecode.com。
这几天对Scala比较感兴趣,看了一些资料,不过语言这东西光看不练还是没感觉,就写了这样一个分词程序。使用了Chih-Hao Tsai的mmseg分词算法。做为依赖注入的粉丝,程序里使用了google guice做为依赖注入容器。
Scala在JVM上的运行效率可以达到和Java差不多的程度。做为同时支持面向对象和函数式编程思想的语言,其强大的表达能力则是Java所忘尘莫及。
以mmseg算法中求一个chunk的自由语素度的代码为例,看看Scala如何使程序变得更加简捷和清晰:
Scala版:
def largestSumMorphemicFreedomDegreeRule(chunks: List[Chunk]): List[Chunk] = {
val c = chunks.reduceLeft((c1, c2) => {if (c1.degreeOfMorphemicFreedom > c2.degreeOfMorphemicFreedom) c1 else c2})
chunks.filter(chunk => (chunk.degreeOfMorphemicFreedom == c.degreeOfMorphemicFreedom))
}
再看看Java版(取自solo L的mmseg库):
public class LSDMFOCWRule implements IRule {/* (non-Javadoc)
* @see org.solol.mmseg.core.IRule#invoke()
*/
public final IChunk[] invoke(final IChunk[] chunks) {
LSDMFOCWRuleComparator[] orderedChunks = new LSDMFOCWRuleComparator[chunks.length];
for (int i = 0; i < chunks.length; i++) {
orderedChunks[i] = new LSDMFOCWRuleComparator(chunks[i]);
}
Arrays.sort(orderedChunks);
int index = 0;
double degreeOfMorphemicFreedom = orderedChunks[index].getChunk().getDegreeOfMorphemicFreedom();
List list = new ArrayList(1);
list.add(orderedChunks[index].getChunk());
index++;
while (index < orderedChunks.length) {
if (orderedChunks[index].getChunk().getDegreeOfMorphemicFreedom() == degreeOfMorphemicFreedom) {
list.add(orderedChunks[index].getChunk());
} else {
break;
}
index++;
}
IChunk[] degreeOfMorphemicFreedomChunks = new IChunk[list.size()];
list.toArray(degreeOfMorphemicFreedomChunks);
return degreeOfMorphemicFreedomChunks;
}
static class LSDMFOCWRuleComparator implements Comparable {
private IChunk chunk;
public LSDMFOCWRuleComparator(IChunk chunk) {
this.chunk = chunk;
}
public IChunk getChunk() {
return chunk;
}
public int compareTo(Object obj) {
IChunk another = ((LSDMFOCWRuleComparator) obj).getChunk();
double temp = another.getDegreeOfMorphemicFreedom()
- chunk.getDegreeOfMorphemicFreedom();
if (temp > 0D) {
return 1;
} else if (temp < 0D) {
return -1;
} else {
return 0;
}
}
}


没有评论▼