YaCy-Bugtracker - YaCy
View Issue Details
0000630YaCy[All Projects] Generalpublic2016-01-12 22:222016-01-19 09:03
GNU/LinuxDebian Jessie
YaCy 1.8 
0000630: Access to Crawling MediaWiki and phpBB3 Forums fail in Robinson mode
When a Yacy node is configured with 'Search portal' or 'Intranet indexing' use cases, access to /Load_MediawikiWiki.html and /Load_PHPBB3.html fails with a HTTP 500 error.

Error details as displayed in browser :

Problem accessing /Load_PHPBB3.html. Reason:

    Server Error

Caused by:

javax.servlet.ServletException: /home/luc/git/yacy_search_server/htroot/Load_PHPBB3.html
    at net.yacy.http.servlets.YaCyDefaultServlet.handleTemplate(YaCyDefaultServlet.java:844)
    at net.yacy.http.servlets.YaCyDefaultServlet.doGet(YaCyDefaultServlet.java:319)
No tags attached.
Issue History
2016-01-12 22:22lucNew Issue
2016-01-12 22:23lucNote Added: 0001202
2016-01-17 01:01BuBuNote Added: 0001203
2016-01-19 09:03lucNote Added: 0001205

2016-01-12 22:23   
This was a NullPointerException case.
I propose a fix : https://github.com/luccioman/yacy_search_server/commit/231be83eb65e7289ad56a3544fc9029dda656009 [^]
2016-01-17 01:01   
On quick try to reproduce behavior I saw the null pointer only in "Intranet" mode.
True, exception shouldn't happen, otherwise by definition of Intranet mode, Mediawiki external URL's shouldn't be accepted.
2016-01-19 09:03   
- On my peer, in Intranet or Portal modes sb.peers.mySeed().getIPs() or sb.peers.mySeed().getIP() always return empty or null. I am behind a router and have no static IP.
So when processing SeedDB.initMySeed (https://github.com/yacy/yacy_search_server/blob/7d0d19cb8eb0817db290fc60555b9262ccb253a7/source/net/yacy/peers/SeedDB.java#L213 [^]), serverSwitch.myPublicIPs returns empty because the only addresses found are local network or loopback addresses...
In P2P mode, my public IP is found when processing Protocol.hello(...).
I guess everything here is normal, as my peer is reported as senior.

- Shouldn't media wiki urls crawling been accepted even in intranet mode (default proposed url is http://localhost:8090/repository/ [^])? We want to be able to crawl a local or local network wiki or PhpBB instance... A test show external urls are correctly rejected in intranet mode. For example "Crawling of "https://fr.wikipedia.org/" [^] failed. Reason: denied_(the host 'fr.wikipedia.org' is global, but global addresses are not accepted"