Hi, I'm developing long running process that should find RSS feeds that all users in the system have registered to follow, parse these RSS feeds, extract new entries and store it back to the database as Hibernate entities, so user can retrieve it. I want to use Apache Spark to enable parallel processing, since this process might take several hours depending on the number of users.
The approach I thought should work was to use *useridsRDD.foreachPartition*, so I can have separate hibernate session for each partition. I created Database session manager that is initialized for each partition which keeps hibernate session alive until the process is over. Once all RSS feeds from one source are parsed and Feed entities are created, I'm sending the whole list to Database Manager method that saves the whole list in batch: > public <T extends BaseEntity> void saveInBatch(List<T> entities) { > try{ > boolean isActive = session.getTransaction().isActive(); > if ( !isActive) { > session.beginTransaction(); > } > for(Object entity:entities){ > session.save(entity); > } > session.getTransaction().commit(); > }catch(Exception ex){ > if(session.getTransaction()!=null) { > session.getTransaction().rollback(); > ex.printStackTrace(); > } > } > > However, this works only if I have one Spark partition. If there are two or more partitions, the whole process is blocked once I try to save the first entity. In order to make the things simpler, I tried to simplify Feed entity, so it doesn't refer and is not referred from any other entity. It also doesn't have any collection. I hope that some of you have already tried something similar and could give me idea how to solve this problem Thanks, Zoran