YaCy-Bugtracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0000575YaCy[All Projects] Generalpublic2015-05-12 17:322015-08-03 00:54
Reportersbolokanov 
Assigned ToBuBu 
PriorityhighSeveritymajorReproducibilityalways
StatusresolvedResolutionfixed 
ETAnone 
PlatformOSOS Version
Product Version 
Target VersionFixed in Version 
Summary0000575: long urls problem
DescriptionHere is the error that shows up in "Rejected URLs":
TEMPORARY_NETWORK_FAILURE cannot load: load error - java.io.IOException: Malformed escape pair at index 255: http://www.vmro.bg/%D0%BF%D1%84-%D0%BD%D1%81-%D0%B4%D0%B0-%D0%BE%D1%81%D1%8A%D0%B4%D0%B8-%D0%B8%D0%B7%D1%82%D1%80%D0%B5%D0%B1%D0%BB%D0%B5%D0%BD%D0%B8%D0%B5%D1%82%D0%BE-%D0%BD%D0%B0-%D0%B1%D1%8A%D0%BB%D0%B3%D0%B0%D1%80%D0%B8%D1%82%D0%B5-%D0%B2-%D0%BE%D1%81% [^]

Full url looks like this: http://www.vmro.bg/%D0%BF%D1%84-%D0%BD%D1%81-%D0%B4%D0%B0-%D0%BE%D1%81%D1%8A%D0%B4%D0%B8-%D0%B8%D0%B7%D1%82%D1%80%D0%B5%D0%B1%D0%BB%D0%B5%D0%BD%D0%B8%D0%B5%D1%82%D0%BE-%D0%BD%D0%B0-%D0%B1%D1%8A%D0%BB%D0%B3%D0%B0%D1%80%D0%B8%D1%82%D0%B5-%D0%B2-%D0%BE%D1%81%D0%BC%D0%B0%D0%BD%D1%81%D0%BA%D0%B0%D1%82%D0%B0-%D0%B8%D0%BC%D0%BF%D0%B5%D1%80%D0%B8%D1%8F/ [^]

I suspect that somewhere in the process url gets cut off by X length.

Using git version with latest commit as of now:
https://github.com/yacy/yacy_search_server/commit/f5f88272e45ae5173791960631d62f07a3da0963 [^]

OS: Slackware64-current
Tagsbug
Attached Fileslog file icon yacy00.log [^] (557,241 bytes) 2015-05-29 10:21

- Relationships

-  Notes
(0001064)
sbolokanov (reporter)
2015-05-29 10:46
edited on: 2015-05-29 17:51

Updated to latest git (47682bf4676c49741bfd3a6c8bee3d3a7e5399a9), same problem. Attached log to the original post.

Whenever yacy tries to proceed long cyrillic urls, they get cutoff and that leaves them broken.
So at the moment sites that use cyrillic urls can't be crawled properly.

I think the problem comes because of encoding of cyrillic urls which makes them too big for some part of yacy or something.
Example:
original url is: http://www.nsi.bg/bg/content/13018/%D0%BE%D1%82%D0%B4%D0%B5%D0%BB-%D1%81%D1%82%D0%B0%D1%82%D0%B8%D1%81%D1%82%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B8-%D0%B8%D0%B7%D1%81%D0%BB%D0%B5%D0%B4%D0%B2%D0%B0%D0%BD%D0%B8%D1%8F-%D1%81%D0%BE%D1%84%D0%B8%D0%B9%D1%81%D0%BA%D0%B0-%D0%BE%D0%B1%D0%BB%D0%B0%D1%81%D1%82-%D0%BD%D0%B0-%D1%82%D1%81%D0%B1-%D1%8E%D0%B3%D0%BE%D0%B7%D0%B0%D0%BF%D0%B0%D0%B4-%D0%B5-%D1%81-%D0%BD%D0%BE%D0%B2-%D0%B0%D0%B4%D1%80%D0%B5%D1%81 [^]

(actual url: http://www.nsi.bg/bg/content/13018/отдел-статистически-изследвания-софийска-област-на-тсб-югозапад-е-с-нов-адрес [^])

but in yacy it looks like this: http://www.nsi.bg/bg/content/13018/%D0%BE%D1%82%D0%B4%D0%B5%D0%BB-%D1%81%D1%82%D0%B0%D1%82%D0%B8%D1%81%D1%82%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B8-%D0%B8%D0%B7%D1%81%D0%BB%D0%B5%D0%B4%D0%B2%D0%B0%D0%BD%D0%B8%D1%8F-%D1%81%D0%BE%D1%84%D0%B8%D0%B9%D1%81%D0%BA%D [^]

(actual url: http://www.nsi.bg/bg/content/13018/отдел-статистически-изследвания-софийск%D [^])

Snippet of Rejected URLs list below:

[quote]
2015/05/29 11:30:08 http://www.nsi.bg/bg/content/8350/%D0%BF%D1%83%D0%B1%D0%BB%D0%B8%D0%BA%D0%B0%D1%86%D0%B8%D1%8F/%D1%81%D1%82%D0%B0%D1%82%D0%B8%D1%81%D1%82%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B8-%D1%81%D0%B1%D0%BE%D1%80%D0%BD%D0%B8%D0%BA-%D0%BD%D0%B0-%D0%BE%D0%B1%D0%BB%D0%B0%D [^] TEMPORARY_NETWORK_FAILURE cannot load: load error - java.io.IOException: Malformed escape pair at index 254: http://www.nsi.bg/bg/content/8350/%D0%BF%D1%83%D0%B1%D0%BB%D0%B8%D0%BA%D0%B0%D1%86%D0%B8%D1%8F/%D1%81%D1%82%D0%B0%D1%82%D0%B8%D1%81%D1%82%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B8-%D1%81%D0%B1%D0%BE%D1%80%D0%BD%D0%B8%D0%BA-%D0%BD%D0%B0-%D0%BE%D0%B1%D0%BB%D0%B0%D [^]
2015/05/29 11:30:08 http://www.nsi.bg/bg/content/924/календарно-изгладени-2010100?qt-statistical_domain=1 [^] TEMPORARY_NETWORK_FAILURE cannot load: load error - java.io.IOException: Client can't execute: No route to host duration=4 for url http://www.nsi.bg/bg/content/924/календарно-изгладени-2010100?qt-statistical_domain=1 [^]
2015/05/29 11:30:08 http://www.nsi.bg/bg/content/3128/%D0%BF%D1%80%D0%B5%D1%81%D1%81%D1%8A%D0%BE%D0%B1%D1%89%D0%B5%D0%BD%D0%B8%D0%B5/%D0%B8%D0%BD%D0%B4%D0%B5%D0%BA%D1%81%D0%B8-%D0%BD%D0%B0-%D0%BF%D0%B0%D0%B7%D0%B0%D1%80%D0%BD%D0%B8%D1%82%D0%B5-%D1%86%D0%B5%D0%BD%D0%B8-%D0%BD% [^] TEMPORARY_NETWORK_FAILURE cannot load: load error - java.io.IOException: Malformed escape pair at index 255: http://www.nsi.bg/bg/content/3128/%D0%BF%D1%80%D0%B5%D1%81%D1%81%D1%8A%D0%BE%D0%B1%D1%89%D0%B5%D0%BD%D0%B8%D0%B5/%D0%B8%D0%BD%D0%B4%D0%B5%D0%BA%D1%81%D0%B8-%D0%BD%D0%B0-%D0%BF%D0%B0%D0%B7%D0%B0%D1%80%D0%BD%D0%B8%D1%82%D0%B5-%D1%86%D0%B5%D0%BD%D0%B8-%D0%BD% [^]
[/quote]

(0001065)
sbolokanov (reporter)
2015-05-29 10:52
edited on: 2015-05-29 10:53

By the way, just did a quick url length check of a few broken urls (malformed) and they all are 256 chars and I find no pattern of a breaking point other than the said number.

(0001068)
BuBu (developer)
2015-06-08 23:48

Indeed, the crawler uses a Queue (file based) which Limits the URL length to 256 chars.

The responsible limiting line is
https://github.com/yacy/yacy_search_server/blob/master/source/net/yacy/crawler/retrieval/Request.java#L51 [^]

But a change has fatal effect on any existing crawl Queue during Startup (I'm not comfortable to touch it).
(0001086)
sbolokanov (reporter)
2015-07-09 17:44

Just want to report:

I've made a new yacy instance and changed the suggested value to 2048. Now it seems to work, fine.
I guess it will be this way for now. At least it's working.
(0001088)
BuBu (developer)
2015-08-03 00:54

fixed in v1.83/9302 with commit
https://github.com/yacy/yacy_search_server/commit/fa08ca207e5aef4a62a345765206e11f44fd2dfc [^]

all running crawls Need to be finished (can be restarted afterwards)

- Issue History
Date Modified Username Field Change
2015-05-12 17:32 sbolokanov New Issue
2015-05-29 10:21 sbolokanov File Added: yacy00.log
2015-05-29 10:46 sbolokanov Note Added: 0001064
2015-05-29 10:52 sbolokanov Note Added: 0001065
2015-05-29 10:53 sbolokanov Note Edited: 0001064 View Revisions
2015-05-29 10:53 sbolokanov Note Edited: 0001065 View Revisions
2015-05-29 10:54 sbolokanov Note Edited: 0001064 View Revisions
2015-05-29 10:55 sbolokanov Note Edited: 0001064 View Revisions
2015-05-29 10:56 sbolokanov Tag Attached: bug
2015-05-29 17:51 sbolokanov Note Edited: 0001064 View Revisions
2015-06-08 23:48 BuBu Note Added: 0001068
2015-06-08 23:55 BuBu Status new => confirmed
2015-07-09 17:44 sbolokanov Note Added: 0001086
2015-08-03 00:54 BuBu Note Added: 0001088
2015-08-03 00:54 BuBu Status confirmed => resolved
2015-08-03 00:54 BuBu Resolution open => fixed
2015-08-03 00:54 BuBu Assigned To => BuBu


Copyright © 2000 - 2019 MantisBT Team
Powered by Mantis Bugtracker