YaCy-Bugtracker - YaCy
View Issue Details
0000575YaCy[All Projects] Generalpublic2015-05-12 17:322015-08-03 00:54
sbolokanov 
BuBu 
highmajoralways
resolvedfixed 
none 
 
 
0000575: long urls problem
Here is the error that shows up in "Rejected URLs":
TEMPORARY_NETWORK_FAILURE cannot load: load error - java.io.IOException: Malformed escape pair at index 255: http://www.vmro.bg/%D0%BF%D1%84-%D0%BD%D1%81-%D0%B4%D0%B0-%D0%BE%D1%81%D1%8A%D0%B4%D0%B8-%D0%B8%D0%B7%D1%82%D1%80%D0%B5%D0%B1%D0%BB%D0%B5%D0%BD%D0%B8%D0%B5%D1%82%D0%BE-%D0%BD%D0%B0-%D0%B1%D1%8A%D0%BB%D0%B3%D0%B0%D1%80%D0%B8%D1%82%D0%B5-%D0%B2-%D0%BE%D1%81% [^]

Full url looks like this: http://www.vmro.bg/%D0%BF%D1%84-%D0%BD%D1%81-%D0%B4%D0%B0-%D0%BE%D1%81%D1%8A%D0%B4%D0%B8-%D0%B8%D0%B7%D1%82%D1%80%D0%B5%D0%B1%D0%BB%D0%B5%D0%BD%D0%B8%D0%B5%D1%82%D0%BE-%D0%BD%D0%B0-%D0%B1%D1%8A%D0%BB%D0%B3%D0%B0%D1%80%D0%B8%D1%82%D0%B5-%D0%B2-%D0%BE%D1%81%D0%BC%D0%B0%D0%BD%D1%81%D0%BA%D0%B0%D1%82%D0%B0-%D0%B8%D0%BC%D0%BF%D0%B5%D1%80%D0%B8%D1%8F/ [^]

I suspect that somewhere in the process url gets cut off by X length.

Using git version with latest commit as of now:
https://github.com/yacy/yacy_search_server/commit/f5f88272e45ae5173791960631d62f07a3da0963 [^]

OS: Slackware64-current
bug
log yacy00.log (557,241) 2015-05-29 10:21
http://mantis.tokeek.de/file_download.php?file_id=204&type=bug
Issue History
2015-05-12 17:32sbolokanovNew Issue
2015-05-29 10:21sbolokanovFile Added: yacy00.log
2015-05-29 10:46sbolokanovNote Added: 0001064
2015-05-29 10:52sbolokanovNote Added: 0001065
2015-05-29 10:53sbolokanovNote Edited: 0001064bug_revision_view_page.php?bugnote_id=1064#r328
2015-05-29 10:53sbolokanovNote Edited: 0001065bug_revision_view_page.php?bugnote_id=1065#r330
2015-05-29 10:54sbolokanovNote Edited: 0001064bug_revision_view_page.php?bugnote_id=1064#r331
2015-05-29 10:55sbolokanovNote Edited: 0001064bug_revision_view_page.php?bugnote_id=1064#r332
2015-05-29 10:56sbolokanovTag Attached: bug
2015-05-29 17:51sbolokanovNote Edited: 0001064bug_revision_view_page.php?bugnote_id=1064#r333
2015-06-08 23:48BuBuNote Added: 0001068
2015-06-08 23:55BuBuStatusnew => confirmed
2015-07-09 17:44sbolokanovNote Added: 0001086
2015-08-03 00:54BuBuNote Added: 0001088
2015-08-03 00:54BuBuStatusconfirmed => resolved
2015-08-03 00:54BuBuResolutionopen => fixed
2015-08-03 00:54BuBuAssigned To => BuBu

Notes
(0001064)
sbolokanov   
2015-05-29 10:46   
(edited on: 2015-05-29 17:51)
Updated to latest git (47682bf4676c49741bfd3a6c8bee3d3a7e5399a9), same problem. Attached log to the original post.

Whenever yacy tries to proceed long cyrillic urls, they get cutoff and that leaves them broken.
So at the moment sites that use cyrillic urls can't be crawled properly.

I think the problem comes because of encoding of cyrillic urls which makes them too big for some part of yacy or something.
Example:
original url is: http://www.nsi.bg/bg/content/13018/%D0%BE%D1%82%D0%B4%D0%B5%D0%BB-%D1%81%D1%82%D0%B0%D1%82%D0%B8%D1%81%D1%82%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B8-%D0%B8%D0%B7%D1%81%D0%BB%D0%B5%D0%B4%D0%B2%D0%B0%D0%BD%D0%B8%D1%8F-%D1%81%D0%BE%D1%84%D0%B8%D0%B9%D1%81%D0%BA%D0%B0-%D0%BE%D0%B1%D0%BB%D0%B0%D1%81%D1%82-%D0%BD%D0%B0-%D1%82%D1%81%D0%B1-%D1%8E%D0%B3%D0%BE%D0%B7%D0%B0%D0%BF%D0%B0%D0%B4-%D0%B5-%D1%81-%D0%BD%D0%BE%D0%B2-%D0%B0%D0%B4%D1%80%D0%B5%D1%81 [^]

(actual url: http://www.nsi.bg/bg/content/13018/отдел-статистически-изследвания-софийска-област-на-тсб-югозапад-е-с-нов-адрес [^])

but in yacy it looks like this: http://www.nsi.bg/bg/content/13018/%D0%BE%D1%82%D0%B4%D0%B5%D0%BB-%D1%81%D1%82%D0%B0%D1%82%D0%B8%D1%81%D1%82%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B8-%D0%B8%D0%B7%D1%81%D0%BB%D0%B5%D0%B4%D0%B2%D0%B0%D0%BD%D0%B8%D1%8F-%D1%81%D0%BE%D1%84%D0%B8%D0%B9%D1%81%D0%BA%D [^]

(actual url: http://www.nsi.bg/bg/content/13018/отдел-статистически-изследвания-софийск%D [^])

Snippet of Rejected URLs list below:

[quote]
2015/05/29 11:30:08 http://www.nsi.bg/bg/content/8350/%D0%BF%D1%83%D0%B1%D0%BB%D0%B8%D0%BA%D0%B0%D1%86%D0%B8%D1%8F/%D1%81%D1%82%D0%B0%D1%82%D0%B8%D1%81%D1%82%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B8-%D1%81%D0%B1%D0%BE%D1%80%D0%BD%D0%B8%D0%BA-%D0%BD%D0%B0-%D0%BE%D0%B1%D0%BB%D0%B0%D [^] TEMPORARY_NETWORK_FAILURE cannot load: load error - java.io.IOException: Malformed escape pair at index 254: http://www.nsi.bg/bg/content/8350/%D0%BF%D1%83%D0%B1%D0%BB%D0%B8%D0%BA%D0%B0%D1%86%D0%B8%D1%8F/%D1%81%D1%82%D0%B0%D1%82%D0%B8%D1%81%D1%82%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B8-%D1%81%D0%B1%D0%BE%D1%80%D0%BD%D0%B8%D0%BA-%D0%BD%D0%B0-%D0%BE%D0%B1%D0%BB%D0%B0%D [^]
2015/05/29 11:30:08 http://www.nsi.bg/bg/content/924/календарно-изгладени-2010100?qt-statistical_domain=1 [^] TEMPORARY_NETWORK_FAILURE cannot load: load error - java.io.IOException: Client can't execute: No route to host duration=4 for url http://www.nsi.bg/bg/content/924/календарно-изгладени-2010100?qt-statistical_domain=1 [^]
2015/05/29 11:30:08 http://www.nsi.bg/bg/content/3128/%D0%BF%D1%80%D0%B5%D1%81%D1%81%D1%8A%D0%BE%D0%B1%D1%89%D0%B5%D0%BD%D0%B8%D0%B5/%D0%B8%D0%BD%D0%B4%D0%B5%D0%BA%D1%81%D0%B8-%D0%BD%D0%B0-%D0%BF%D0%B0%D0%B7%D0%B0%D1%80%D0%BD%D0%B8%D1%82%D0%B5-%D1%86%D0%B5%D0%BD%D0%B8-%D0%BD% [^] TEMPORARY_NETWORK_FAILURE cannot load: load error - java.io.IOException: Malformed escape pair at index 255: http://www.nsi.bg/bg/content/3128/%D0%BF%D1%80%D0%B5%D1%81%D1%81%D1%8A%D0%BE%D0%B1%D1%89%D0%B5%D0%BD%D0%B8%D0%B5/%D0%B8%D0%BD%D0%B4%D0%B5%D0%BA%D1%81%D0%B8-%D0%BD%D0%B0-%D0%BF%D0%B0%D0%B7%D0%B0%D1%80%D0%BD%D0%B8%D1%82%D0%B5-%D1%86%D0%B5%D0%BD%D0%B8-%D0%BD% [^]
[/quote]

(0001065)
sbolokanov   
2015-05-29 10:52   
(edited on: 2015-05-29 10:53)
By the way, just did a quick url length check of a few broken urls (malformed) and they all are 256 chars and I find no pattern of a breaking point other than the said number.

(0001068)
BuBu   
2015-06-08 23:48   
Indeed, the crawler uses a Queue (file based) which Limits the URL length to 256 chars.

The responsible limiting line is
https://github.com/yacy/yacy_search_server/blob/master/source/net/yacy/crawler/retrieval/Request.java#L51 [^]

But a change has fatal effect on any existing crawl Queue during Startup (I'm not comfortable to touch it).
(0001086)
sbolokanov   
2015-07-09 17:44   
Just want to report:

I've made a new yacy instance and changed the suggested value to 2048. Now it seems to work, fine.
I guess it will be this way for now. At least it's working.
(0001088)
BuBu   
2015-08-03 00:54   
fixed in v1.83/9302 with commit
https://github.com/yacy/yacy_search_server/commit/fa08ca207e5aef4a62a345765206e11f44fd2dfc [^]

all running crawls Need to be finished (can be restarted afterwards)