View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0000646YaCy[All Projects] Generalpublic2016-03-22 11:182016-09-01 15:54
Assigned To 
PlatformOSOS Version
Product VersionYaCy 1.8 
Target VersionFixed in Version 
Summary0000646: Refactoring of some postprocessing procedures
DescriptionIt would be very great to improve some postprocessing procedures of YaCy since they're taking currently a very long time to complete.
Additional InformationPostprocessing Progress
busy:postprocessed 8200 from 100556956 collection documents; 8 ppm; 12189368 minutes remaining
TagsNo tags attached.
Attached Filespng file icon postProcessProfiling.png [^] (113,996 bytes) 2016-08-26 13:55

- Relationships

-  Notes
LA_FORGE (reporter)
2016-03-22 11:29

The system load is very high during the process mentioned above:
load average: 10.54, 10.92, 10.72
These values representing load of 140% for my 8-core System


luc (reporter)
2016-08-26 14:03
edited on: 2016-08-26 14:05

Hi, I attached a first profiling trace performed with VisualVM on a Debian Jessie machine, postprocessing 11240 recently crawled documents, with webgraph Solr core disabled.

With this config, the two main hotspots consuming processor time are clearly identified and are with no surprise related to Solr operations :
- in CollectionConfiguration.postprocessing_doublecontent : the internal parsing of the Solr query used to search for double documents (ExtendedDismaxQParser.parse - 41 %)
- in CollectionConfiguration.postprocessing : the partial update (ConcurrentUpdateSolrConnector.update() - 34,1 %) of each document

LA_FORGE (reporter)
2016-08-28 20:06

Thank you very much for the detailed analysis. Using Debian Jessie here, too. Runs much smoother since upgrading to YaCy 1.9 and Java 8 using additional java args -XX:+UseParallelGC -XX:+UseNUMA in an mutiprocessor environment. I'm sure the new solr version is improved, too.
luc (reporter)
2016-08-30 09:44

I continued experimenting a little bit to see how much this can be improved.
Just to check, I disabled double documents search and modified how updates are performed : I grouped them by collections of 100 and even 1000 documents to avoid committing at each update.
With my small test documents number (11240) this approximately divides processing time by two... But no miracle, with this logic and millions of documents to process, the time would still be far too long.
I wonder if this processings could not be performed at another more appropriate place/time...
luc (reporter)
2016-09-01 15:54

Unfortunately more extended profiling revealed that the idea of grouping commit updates is ineffective... The main real CPU burning hotspot is in the postprocessing_doublecontent query.
With my current knowledge I do not know how to drastically improve this...

By the way, for now I commited some refactorings to at least more easily understand the whole postprocessing algorithm, and profile performances with VisualVM for example.
See Pull Request : https://github.com/yacy/yacy_search_server/pull/71 [^]

- Issue History
Date Modified Username Field Change
2016-03-22 11:18 LA_FORGE New Issue
2016-03-22 11:29 LA_FORGE Note Added: 0001230
2016-08-26 13:55 luc File Added: postProcessProfiling.png
2016-08-26 14:03 luc Note Added: 0001287
2016-08-26 14:05 luc Note Edited: 0001287 View Revisions
2016-08-28 20:06 LA_FORGE Note Added: 0001288
2016-08-30 09:44 luc Note Added: 0001291
2016-09-01 15:54 luc Note Added: 0001292

Copyright © 2000 - 2020 MantisBT Team
Powered by Mantis Bugtracker