Hi Ari,

> Am 07.03.2017 um 23:14 schrieb Aristedes Maniatis <[email protected]>:
> 
> On 7/3/17 8:25am, Musall, Maik wrote:
>> Hi all,
>> 
>> I have a number of statistics functions which need to fetch large amounts of 
>> objects. I need the actual DataObjects because that's where the business 
>> logic is that I need for the computations.
>> 
>> Let's say I need to fetch 300.000 objects. Let's also assume the database 
>> sits on a fast SSD array and can serve multiple connections easily. I'm 
>> assuming in this case the CPU time needed for DataObject instantiation is 
>> the main performance constraint. Is that correct?
>> 
>> If so, how can I speed this up? Could I partition my fetch, and fetch in 
>> several threads in parallel into the same ObjectContext? Or is there an 
>> easier way to make use of multiple CPU cores for this?
> 
> 
> I don't think there is anything in Cayenne that will specifically help you 
> here. However if you can partition your search query, the of course you can 
> fetch the data in multiple threads in parallel.
> 
> You might also want to fetch into DataRows rather than creating object 
> entities. I'm not sure if that will make your use case faster, but you could 
> try, especially if you don't need all the columns from the db entity.

I tried that already. Results:

regular SelectQuery: 25888 ms for 1291644 objects
DataRowQuery alone: 14289 ms for 1291644 rows
DataRowQuery sequential instantiation: 6878 ms for 1291644 objects, sum = 21167
DataRowQuery parallel instantiation: 7351 ms for 1291644 objects, sum = 21640
DataRowQuery with iterator: 22484 ms for 1291644 objects
DataRowQuery with batch iterator of 100 each: 21219 ms for 1291644 objects

sequential/parallel was stream() vs. parallelStream(). The difference between 
parallel and sequential instantiation was random.

So, all in all not that much of a difference. The DataRowQuery alone is faster 
of course, but once you add the instantiation, it ends up in the same ballpark 
as the regular SelectQuery. A bit faster, but probably not worth the additional 
coding, or deviating from the regular APIs.

Consistently fastest was doing the parallel fetch: DataRowQuery parallel 
fetch+instantiation: 19357 ms for 1291644 objects. I partitioned the fetch into 
4 pieces (exprs is a list of 4 expressions), and then did:

        List<PDCMarketingInfo> objects = exprs.parallelStream()
                .flatMap( exp -> {
                        SelectQuery<DataRow> dataRowQuery = 
SelectQuery.dataRowQuery( PDCMarketingInfo.class, exp );
                        List<DataRow> dataRows = dataRowQuery.select( oc );
                        return dataRows.parallelStream().map( row -> 
oc.objectFromDataRow( PDCMarketingInfo.class, row ) );
                } )
                .collect( Collectors.toList() );

I also did this with iterator instead of dataRowQuery.select(), but that was 
slower.

There may be more benefit from parallelization depending on the hardware used. 
This was my 2013 MBP with 4 i7 cores.

Maik

Reply via email to