YaCy-Bugtracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0000717YaCy[All Projects] Generalpublic2017-01-05 21:262017-01-06 03:04
ReporterBuBu 
Assigned ToBuBu 
PrioritynormalSeveritymajorReproducibilityalways
StatusresolvedResolutionfixed 
ETAnone 
PlatformOSOS Version
Product Version 
Target VersionFixed in Version 
Summary0000717: Index document with wrong field content for metadata from html tags
DescriptionAfter parsing/crawling several documents, some index documents have wrong content e.g. in h1_txt and other text extracts from html tags (e.g. underline_txt or image_* index fields)

Example: <h1_txt> is not part of the page at all:
<doc>
<str name="id">oqcOSGHd8iIa</str>
<str name="sku">http://worldbuilding.stackexchange.com/questions/66895/ultimate-australian-canal</str> [^]

<arr name="title">
<str>climate - Ultimate Australian Canal - Worldbuilding Stack Exchange</str>
</arr>

<arr name="h1_txt">
<str>Spendenaufruf : Wikipedia sammelt 8,7 Millionen Euro</str>
</arr>
<int name="h1_i">1</int>
Additional InformationDebug Info:
Tag/field content comes from the scraper.
The used scraper is remembered in the htmlParser.

But parser is reused for several documents and the used scraper is set to the current document, while the indexing process might work on a earlier document.

yacy2solr gets in this concurrency situation the earlier document but current scraperObject.
TagsNo tags attached.
Attached Files

- Relationships

-  Notes
(0001370)
BuBu (developer)
2017-01-06 03:04

see commit https://github.com/yacy/yacy_search_server/commit/4c9be29a55b51d9937137806ed4f248875c32a2b [^]

- Issue History
Date Modified Username Field Change
2017-01-05 21:26 BuBu New Issue
2017-01-05 21:26 BuBu Status new => assigned
2017-01-05 21:26 BuBu Assigned To => BuBu
2017-01-06 03:04 BuBu Note Added: 0001370
2017-01-06 03:04 BuBu Status assigned => resolved
2017-01-06 03:04 BuBu Resolution open => fixed


Copyright © 2000 - 2017 MantisBT Team
Powered by Mantis Bugtracker