Hi,

I'm developing long running process that should find RSS feeds that all
users in the system have registered to follow, parse these RSS feeds,
extract new entries and store it back to the database as Hibernate
entities, so user can retrieve it. I want to use Apache Spark to enable
parallel processing, since this process might take several hours depending
on the number of users.

The approach I thought should work was to use *useridsRDD.foreachPartition*,
so I can have separate hibernate session for each partition. I created
Database session manager that is initialized for each partition which keeps
hibernate session alive until the process is over.

Once all RSS feeds from one source are parsed and Feed entities are
created, I'm sending the whole list to Database Manager method that saves
the whole list in batch:

> public  <T extends BaseEntity> void saveInBatch(List<T> entities) {
>     try{
>       boolean isActive = session.getTransaction().isActive();
>         if ( !isActive) {
>             session.beginTransaction();
>         }
>        for(Object entity:entities){
>          session.save(entity);
>         }
>        session.getTransaction().commit();
>      }catch(Exception ex){
>     if(session.getTransaction()!=null) {
>         session.getTransaction().rollback();
>         ex.printStackTrace();
>    }
>   }
>
> However, this works only if I have one Spark partition. If there are two
or more partitions, the whole process is blocked once I try to save the
first entity. In order to make the things simpler, I tried to simplify Feed
entity, so it doesn't refer and is not referred from any other entity. It
also doesn't have any collection.

I hope that some of you have already tried something similar and could give
me idea how to solve this problem

Thanks,
Zoran

Reply via email to