Hi Ari,
> Am 07.03.2017 um 23:14 schrieb Aristedes Maniatis <[email protected]>:
>
> On 7/3/17 8:25am, Musall, Maik wrote:
>> Hi all,
>>
>> I have a number of statistics functions which need to fetch large amounts of
>> objects. I need the actual DataObjects because that's where the business
>> logic is that I need for the computations.
>>
>> Let's say I need to fetch 300.000 objects. Let's also assume the database
>> sits on a fast SSD array and can serve multiple connections easily. I'm
>> assuming in this case the CPU time needed for DataObject instantiation is
>> the main performance constraint. Is that correct?
>>
>> If so, how can I speed this up? Could I partition my fetch, and fetch in
>> several threads in parallel into the same ObjectContext? Or is there an
>> easier way to make use of multiple CPU cores for this?
>
>
> I don't think there is anything in Cayenne that will specifically help you
> here. However if you can partition your search query, the of course you can
> fetch the data in multiple threads in parallel.
>
> You might also want to fetch into DataRows rather than creating object
> entities. I'm not sure if that will make your use case faster, but you could
> try, especially if you don't need all the columns from the db entity.
I tried that already. Results:
regular SelectQuery: 25888 ms for 1291644 objects
DataRowQuery alone: 14289 ms for 1291644 rows
DataRowQuery sequential instantiation: 6878 ms for 1291644 objects, sum = 21167
DataRowQuery parallel instantiation: 7351 ms for 1291644 objects, sum = 21640
DataRowQuery with iterator: 22484 ms for 1291644 objects
DataRowQuery with batch iterator of 100 each: 21219 ms for 1291644 objects
sequential/parallel was stream() vs. parallelStream(). The difference between
parallel and sequential instantiation was random.
So, all in all not that much of a difference. The DataRowQuery alone is faster
of course, but once you add the instantiation, it ends up in the same ballpark
as the regular SelectQuery. A bit faster, but probably not worth the additional
coding, or deviating from the regular APIs.
Consistently fastest was doing the parallel fetch: DataRowQuery parallel
fetch+instantiation: 19357 ms for 1291644 objects. I partitioned the fetch into
4 pieces (exprs is a list of 4 expressions), and then did:
List<PDCMarketingInfo> objects = exprs.parallelStream()
.flatMap( exp -> {
SelectQuery<DataRow> dataRowQuery =
SelectQuery.dataRowQuery( PDCMarketingInfo.class, exp );
List<DataRow> dataRows = dataRowQuery.select( oc );
return dataRows.parallelStream().map( row ->
oc.objectFromDataRow( PDCMarketingInfo.class, row ) );
} )
.collect( Collectors.toList() );
I also did this with iterator instead of dataRowQuery.select(), but that was
slower.
There may be more benefit from parallelization depending on the hardware used.
This was my 2013 MBP with 4 i7 cores.
Maik