mapreduce - Nutch on Hadoop | Input path does not exist: -
i getting error input path not exist when run command
nutch inject crawldb urls
in nutch/logs got error in hadoop.log
2015-08-16 16:08:12,834 info crawl.injector - injector: starting @ 2015-08-16 16:08:12 2015-08-16 16:08:12,834 info crawl.injector - injector: crawldb: crawldb 2015-08-16 16:08:12,835 info crawl.injector - injector: urldir: urls 2015-08-16 16:08:12,835 info crawl.injector - injector: converting injected urls crawl db entries. 2015-08-16 16:08:13,296 warn util.nativecodeloader - unable load native-hadoop library platform... using builtin-java classes applicable 2015-08-16 16:08:13,417 warn snappy.loadsnappy - snappy native library not loaded 2015-08-16 16:08:13,430 error security.usergroupinformation - priviledgedactionexception as:hdravi cause:org.apache.hadoop.mapred.invalidinputexception: input path not exist: file:/home/hdravi/urls 2015-08-16 16:08:13,432 error crawl.injector - injector: org.apache.hadoop.mapred.invalidinputexception: input path not exist: file:/home/hdravi/urls @ org.apache.hadoop.mapred.fileinputformat.liststatus(fileinputformat.java:197) @ org.apache.hadoop.mapred.fileinputformat.getsplits(fileinputformat.java:208) @ org.apache.hadoop.mapred.jobclient.writeoldsplits(jobclient.java:1081) @ org.apache.hadoop.mapred.jobclient.writesplits(jobclient.java:1073) @ org.apache.hadoop.mapred.jobclient.access$700(jobclient.java:179) @ org.apache.hadoop.mapred.jobclient$2.run(jobclient.java:983) @ org.apache.hadoop.mapred.jobclient$2.run(jobclient.java:936) @ java.security.accesscontroller.doprivileged(native method) @ javax.security.auth.subject.doas(subject.java:415) @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1190) @ org.apache.hadoop.mapred.jobclient.submitjobinternal(jobclient.java:936) @ org.apache.hadoop.mapred.jobclient.submitjob(jobclient.java:910) @ org.apache.hadoop.mapred.jobclient.runjob(jobclient.java:1353) @ org.apache.nutch.crawl.injector.inject(injector.java:323) @ org.apache.nutch.crawl.injector.run(injector.java:379) @ org.apache.hadoop.util.toolrunner.run(toolrunner.java:65) @ org.apache.nutch.crawl.injector.main(injector.java:369)
it how searches in local file system.
this content of hadoop's core-site.xml
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>a base other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>the name of default file system. uri scheme , authority determine filesystem implementation. uri's scheme determines config property (fs.scheme.impl) naming filesystem implementation class. uri's authority used determine host, port, etc. filesystem.</description> </property> </configuration>
this content hadoop's hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> <description>default block replication. actual number of replications can specified when file created. default used if replication not specified in create time. </description> </property> </configuration>
when type hadoop fs -ls -r /
, output
drwxrwxrwx - hdravi supergroup 0 2015-08-16 16:06 /user drwxrwxrwx - hdravi supergroup 0 2015-08-16 16:06 /user/hdravi drwxr-xr-x - hdravi supergroup 0 2015-08-16 16:06 /user/hdravi/urls -rw-r--r-- 1 hdravi supergroup 240 2015-08-16 16:06 /user/hdravi/urls/seed.txt
am missing configuration in hadoop/nutch?
update
i following error when use complete hdfs path
2015-08-16 23:33:22,876 info crawl.injector - injector: starting @ 2015-08-16 23:33:22 2015-08-16 23:33:22,877 info crawl.injector - injector: crawldb: crawldb 2015-08-16 23:33:22,877 info crawl.injector - injector: urldir: hdfs://localhost:54310/user/hdravi/user/hdravi/urls 2015-08-16 23:33:22,878 info crawl.injector - injector: converting injected urls crawl db entries. 2015-08-16 23:33:23,317 warn util.nativecodeloader - unable load native-hadoop library platform... using builtin-java classes applicable 2015-08-16 23:33:23,410 warn snappy.loadsnappy - snappy native library not loaded 2015-08-16 23:33:23,762 error security.usergroupinformation - priviledgedactionexception as:hdravi cause:org.apache.hadoop.ipc.remoteexception: server ipc version 9 cannot communicate client version 4 2015-08-16 23:33:23,764 error crawl.injector - injector: org.apache.hadoop.ipc.remoteexception: server ipc version 9 cannot communicate client version 4 @ org.apache.hadoop.ipc.client.call(client.java:1107) @ org.apache.hadoop.ipc.rpc$invoker.invoke(rpc.java:229) @ com.sun.proxy.$proxy1.getprotocolversion(unknown source) @ sun.reflect.nativemethodaccessorimpl.invoke0(native method) @ sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:57) @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:43) @ java.lang.reflect.method.invoke(method.java:606) @ org.apache.hadoop.io.retry.retryinvocationhandler.invokemethod(retryinvocationhandler.java:85) @ org.apache.hadoop.io.retry.retryinvocationhandler.invoke(retryinvocationhandler.java:62) @ com.sun.proxy.$proxy1.getprotocolversion(unknown source) @ org.apache.hadoop.ipc.rpc.checkversion(rpc.java:422) @ org.apache.hadoop.hdfs.dfsclient.createnamenode(dfsclient.java:183) @ org.apache.hadoop.hdfs.dfsclient.<init>(dfsclient.java:281) @ org.apache.hadoop.hdfs.dfsclient.<init>(dfsclient.java:245) @ org.apache.hadoop.hdfs.distributedfilesystem.initialize(distributedfilesystem.java:100) @ org.apache.hadoop.fs.filesystem.createfilesystem(filesystem.java:1437) @ org.apache.hadoop.fs.filesystem.access$200(filesystem.java:66) @ org.apache.hadoop.fs.filesystem$cache.get(filesystem.java:1455) @ org.apache.hadoop.fs.filesystem.get(filesystem.java:254) @ org.apache.hadoop.fs.path.getfilesystem(path.java:187) @ org.apache.hadoop.mapred.fileinputformat.liststatus(fileinputformat.java:176) @ org.apache.hadoop.mapred.fileinputformat.getsplits(fileinputformat.java:208) @ org.apache.hadoop.mapred.jobclient.writeoldsplits(jobclient.java:1081) @ org.apache.hadoop.mapred.jobclient.writesplits(jobclient.java:1073) @ org.apache.hadoop.mapred.jobclient.access$700(jobclient.java:179) @ org.apache.hadoop.mapred.jobclient$2.run(jobclient.java:983) @ org.apache.hadoop.mapred.jobclient$2.run(jobclient.java:936) @ java.security.accesscontroller.doprivileged(native method) @ javax.security.auth.subject.doas(subject.java:415) @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1190) @ org.apache.hadoop.mapred.jobclient.submitjobinternal(jobclient.java:936) @ org.apache.hadoop.mapred.jobclient.submitjob(jobclient.java:910) @ org.apache.hadoop.mapred.jobclient.runjob(jobclient.java:1353) @ org.apache.nutch.crawl.injector.inject(injector.java:323) @ org.apache.nutch.crawl.injector.run(injector.java:379) @ org.apache.hadoop.util.toolrunner.run(toolrunner.java:65) @ org.apache.nutch.crawl.injector.main(injector.java:369)
i not sure nutch, regarding hadoop try loading configuration files using configuration object before starting mapreduce job.
this solution works me:
configuration conf = new configuration(); conf.addresource(new path("path hadoop/conf/core-site.xml")); conf.addresource(new path("path hadoop/conf/hdfs-site.xml")); filesystem fs = filesystem.get(conf);
you may give try full path of input directory
hdfs://localhost:54310/user/hdravi
Comments
Post a Comment