Hi, Thank You Erick for your inputs. I tried creating batches of 1000 objects and indexing it to solr. The performance is way better than before but I find that number of indexed documents that is shown in the dashboard is lesser than the number of documents that I had actually indexed through solrj. My code is as follows:
private static String SOLR_SERVER_URL = "http://localhost:8983/solr/newcore "; private static String JSON_FILE_PATH = "/home/vineeth/week1_fixed.json"; private static JSONParser parser = new JSONParser(); private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL); public static void main(String[] args) throws IOException, SolrServerException, ParseException { File file = new File(JSON_FILE_PATH); Scanner scn=new Scanner(file,"UTF-8"); JSONObject object; int i = 0; Collection<SolrInputDocument> batch = new ArrayList<SolrInputDocument>(); while(scn.hasNext()){ object= (JSONObject) parser.parse(scn.nextLine()); SolrInputDocument doc = indexJSON(object); batch.add(doc); if(i%1000==0){ System.out.println("Indexed " + (i+1) + " objects." ); solr.add(batch); batch = new ArrayList<SolrInputDocument>(); } i++; } solr.add(batch); solr.commit(); System.out.println("Indexed " + (i+1) + " objects." ); } public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws ParseException, IOException, SolrServerException { Collection<SolrInputDocument> batch = new ArrayList<SolrInputDocument>(); SolrInputDocument mainEvent = new SolrInputDocument(); mainEvent.addField("id", generateID()); mainEvent.addField("RawEventMessage", jsonOBJ.get("RawEventMessage")); mainEvent.addField("EventUid", jsonOBJ.get("EventUid")); mainEvent.addField("EventCollector", jsonOBJ.get("EventCollector")); mainEvent.addField("EventMessageType", jsonOBJ.get("EventMessageType")); mainEvent.addField("TimeOfEvent", jsonOBJ.get("TimeOfEvent")); mainEvent.addField("TimeOfEventUTC", jsonOBJ.get("TimeOfEventUTC")); Object obj = parser.parse(jsonOBJ.get("User").toString()); JSONObject userObj = (JSONObject) obj; SolrInputDocument childUserEvent = new SolrInputDocument(); childUserEvent.addField("id", generateID()); childUserEvent.addField("User", userObj.get("User")); obj = parser.parse(jsonOBJ.get("EventDescription").toString()); JSONObject eventdescriptionObj = (JSONObject) obj; SolrInputDocument childEventDescEvent = new SolrInputDocument(); childEventDescEvent.addField("id", generateID()); childEventDescEvent.addField("EventApplicationName", eventdescriptionObj.get("EventApplicationName")); childEventDescEvent.addField("Query", eventdescriptionObj.get("Query")); obj= JSONValue.parse(eventdescriptionObj.get("Information").toString()); JSONArray informationArray = (JSONArray) obj; for(int i = 0; i<informationArray.size(); i++){ JSONObject domain = (JSONObject) informationArray.get(i); SolrInputDocument domainDoc = new SolrInputDocument(); domainDoc.addField("id", generateID()); domainDoc.addField("domainName", domain.get("domainName")); String s = domain.get("columns").toString(); obj= JSONValue.parse(s); JSONArray ColumnsArray = (JSONArray) obj; SolrInputDocument columnsDoc = new SolrInputDocument(); columnsDoc.addField("id", generateID()); for(int j = 0; j<ColumnsArray.size(); j++){ JSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j); SolrInputDocument columnDoc = new SolrInputDocument(); columnDoc.addField("id", generateID()); columnDoc.addField("movieName", ColumnsObj.get("movieName")); columnsDoc.addChildDocument(columnDoc); } domainDoc.addChildDocument(columnsDoc); childEventDescEvent.addChildDocument(domainDoc); } mainEvent.addChildDocument(childEventDescEvent); mainEvent.addChildDocument(childUserEvent); return mainEvent; } I would be grateful if you could let me know what I am missing. On Sun, Jul 19, 2015 at 2:16 PM, Erick Erickson <erickerick...@gmail.com> wrote: > First thing is it looks like you're only sending one document at a > time, perhaps with child objects. This is not optimal at all. I > usually batch my docs up in groups of 1,000, and there is anecdotal > evidence that there may (depending on the docs) be some gains above > that number. Gotta balance the batch size off against how bug the docs > are of course. > > Assuming that you really are calling this method for one doc (and > children) at a time, the far bigger problem other than calling > server.add for each parent/children is that you're then calling > solr.commit() every time. This is an anti-pattern. Generally, let the > autoCommit setting in solrconfig.xml handle the intermediate commits > while the indexing program is running and only issue a commit at the > very end of the job if at all. > > Best, > Erick > > On Sun, Jul 19, 2015 at 12:08 PM, Vineeth Dasaraju > <vineeth.ii...@gmail.com> wrote: > > Hi, > > > > I am trying to index JSON objects (which contain nested JSON objects and > > Arrays in them) into solr. > > > > My JSON Object looks like the following (This is fake data that I am > using > > for this example): > > > > { > > "RawEventMessage": "Lorem ipsum dolor sit amet, consectetur > adipiscing > > elit. Aliquam dolor orci, placerat ac pretium a, tincidunt consectetur > > mauris. Etiam sollicitudin sapien id odio tempus, non sodales odio > iaculis. > > Donec fringilla diam at placerat interdum. Proin vitae arcu non augue > > facilisis auctor id non neque. Integer non nibh sit amet justo facilisis > > semper a vel ligula. Pellentesque commodo vulputate consequat. ", > > "EventUid": "1279706565", > > "TimeOfEvent": "2015-05-01-08-07-13", > > "TimeOfEventUTC": "2015-05-01-01-07-13", > > "EventCollector": "kafka", > > "EventMessageType": "kafka-@column", > > "User": { > > "User": "Lorem ipsum", > > "UserGroup": "Manager", > > "Location": "consectetur adipiscing", > > "Department": "Legal" > > }, > > "EventDescription": { > > "EventApplicationName": "", > > "Query": "SELECT * FROM MOVIES", > > "Information": [ > > { > > "domainName": "English", > > "columns": [ > > { > > "movieName": "Casablanca", > > "duration": "154", > > }, > > { > > "movieName": "Die Hard", > > "duration": "127", > > } > > ] > > }, > > { > > "domainName": "Hindi", > > "columns": [ > > { > > "movieName": "DDLJ", > > "duration": "176", > > } > > ] > > } > > ] > > } > > } > > > > > > > > My function for indexing the object is as follows: > > > > public static void indexJSON(JSONObject jsonOBJ) throws ParseException, > > IOException, SolrServerException { > > Collection<SolrInputDocument> batch = new > > ArrayList<SolrInputDocument>(); > > > > SolrInputDocument mainEvent = new SolrInputDocument(); > > mainEvent.addField("id", generateID()); > > mainEvent.addField("RawEventMessage", > jsonOBJ.get("RawEventMessage")); > > mainEvent.addField("EventUid", jsonOBJ.get("EventUid")); > > mainEvent.addField("EventCollector", jsonOBJ.get("EventCollector")); > > mainEvent.addField("EventMessageType", > jsonOBJ.get("EventMessageType")); > > mainEvent.addField("TimeOfEvent", jsonOBJ.get("TimeOfEvent")); > > mainEvent.addField("TimeOfEventUTC", jsonOBJ.get("TimeOfEventUTC")); > > > > Object obj = parser.parse(jsonOBJ.get("User").toString()); > > JSONObject userObj = (JSONObject) obj; > > > > SolrInputDocument childUserEvent = new SolrInputDocument(); > > childUserEvent.addField("id", generateID()); > > childUserEvent.addField("User", userObj.get("User")); > > > > obj = parser.parse(jsonOBJ.get("EventDescription").toString()); > > JSONObject eventdescriptionObj = (JSONObject) obj; > > > > SolrInputDocument childEventDescEvent = new SolrInputDocument(); > > childEventDescEvent.addField("id", generateID()); > > childEventDescEvent.addField("EventApplicationName", > > eventdescriptionObj.get("EventApplicationName")); > > childEventDescEvent.addField("Query", > eventdescriptionObj.get("Query")); > > > > obj= > JSONValue.parse(eventdescriptionObj.get("Information").toString()); > > JSONArray informationArray = (JSONArray) obj; > > > > for(int i = 0; i<informationArray.size(); i++){ > > JSONObject domain = (JSONObject) informationArray.get(i); > > > > SolrInputDocument domainDoc = new SolrInputDocument(); > > domainDoc.addField("id", generateID()); > > domainDoc.addField("domainName", domain.get("domainName")); > > > > String s = domain.get("columns").toString(); > > obj= JSONValue.parse(s); > > JSONArray ColumnsArray = (JSONArray) obj; > > > > SolrInputDocument columnsDoc = new SolrInputDocument(); > > columnsDoc.addField("id", generateID()); > > > > for(int j = 0; j<ColumnsArray.size(); j++){ > > JSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j); > > SolrInputDocument columnDoc = new SolrInputDocument(); > > columnDoc.addField("id", generateID()); > > columnDoc.addField("movieName", ColumnsObj.get("movieName")); > > columnsDoc.addChildDocument(columnDoc); > > } > > domainDoc.addChildDocument(columnsDoc); > > childEventDescEvent.addChildDocument(domainDoc); > > } > > > > mainEvent.addChildDocument(childEventDescEvent); > > mainEvent.addChildDocument(childUserEvent); > > batch.add(mainEvent); > > solr.add(batch); > > solr.commit(); > > } > > > > When I try to index the using the above code, I am able to index only 12 > > Objects per second. Is there a faster way to do the indexing? I believe I > > am using the json-fast parser which is one of the fastest parsers for > json. > > > > Your help will be very valuable to me. > > > > Thanks, > > Vineeth >