Spark使用HanLP分词
# Spark使用HanLP分词
将HanLP的data(包含词典和模型)放到hdfs (opens new window)上,然后在项目配置文件hanlp.properties中配置root的路径,比如:
root=hdfs://localhost:9000/tmp/
实现com.hankcs.hanlp.corpus.io.IIOAdapter接口
public static class HadoopFileIoAdapter implements IIOAdapter { @Override public InputStream open(String path) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(path), conf); return fs.open(new Path(path)); } @Override public OutputStream create(String path) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(path), conf); OutputStream out = fs.create(new Path(path)); return out; } }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17设置IoAdapter,创建分词器
private static Segment segment; static { HanLP.Config.IOAdapter = new HadoopFileIoAdapter(); segment = new CRFSegment(); }
1
2
3
4
5
6然后,就可以在Spark的操作中使用segment进行分词了。
原文链接:https://blog.csdn.net/l294265421/article/details/72932042
上次更新: 2023/03/10, 16:49:38