Hi all,
to keep this thread up to date... ;-)
d) jdbc batch size
changed to 10. (Was default: 500, then 1000)
The problem with my dih setup is that the root entity query returns a
huge set (all ids that shall be indexed). A larger fetchsize would be
good for that query.
The nested entity, however, returns only up 9 rows, ever. The
constraints are so strict (by id) that there is no way that any
additional data could be pre-fetched.
(Actually, anynone using DIH with nested entities should run into that
problem?)
After changing to 10, I cannot see that this low batch size slowed the
indexer down (significantly).
As I would like to stick with DIH (instead of dumping the data into CSV
and import it then) here is my question:
Do you think it's possible to return (in the nested entity) rows
independent of the unique id, and let the processor decide when a
document is complete?
The examples in the wiki always use an ID to get the data for the nested
entity, so I'm not sure it was planned with that in mind. But as I'm
already handling multiple db rows for one document, it might not be too
difficult to change to handling the unique id correctly, as well?
Of course, I would need something like a look ahead to know whether the
next row is already part of the next document.
Cheers,
Chantal
Concerning the other settings (just fyi):
a) mergeFactor 10 (and also tried 100)
I don't think that changed anything to the worse, rather to the better.
So, I'll stick with 10 from now on.
b) ramBufferSizeMB
tried 512, 1024. RAM usage went up when I increased from 256 to 512. Not
sure about 1024. I'll stick to 512.