> I am trying to import some Common Crawl dataset files into Nutch. > Those files are in Arc file format. > I tried using ArcSegmentCreator tool, but that didn't work well. >
I think Common Crawl which uses a slightly different definition of ARCs, not sure though. Anyway they have released a library to read/write to their format https://github.com/commoncrawl/commoncrawl which I have tried to use with Behemoth https://github.com/jnioche/behemoth-commoncrawl but without much luck so far. > It was using up all the heap space. Increasing heap space limit didn't > help. > > Does anyone have any thoughts on this? > Is there a better way to import Common Crawl files? > see above. Depends on what you need to do. Behemoth has a Tika module ready, so if you want to parse the dataset this would be a good option > Why does ArcSegmentCreator have issues? > strange question. sounds as if bugs were written for a purpose. anyway, there might be a bug but again, I am not sure the ARCs generated by CC are at the same format Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

