On Mon, Sep 26, 2011 at 3:51 PM, Marcel Bruch <[email protected]> wrote:
> Thanks Stefan. I gave it a try. Could you or someone else comment on
> the code and its performance?
>
> I wrote a fairly ad-hoc dump of the 5900 data files into Jackrabbit.
> Storing ~240 MB took roughly 3 minutes. Is this the expected time such
> an operation takes? Is it possible to improve the performance somehow?
the performance seems rather poor. it's hard to tell what's wrong
without having the test data. i noticed that you're storing the
content of the .json files as string properties. why aren't you
storing the json data as nodes & properties?
anyway, i quickly ran an adapted ad hoc test on my machine
(macbook pro 2.66 ghz, standard harddisk). the test imports
an 'svn export' of jackrabbit/trunk.
importing ~6500 files takes ~30s which is IMO decent.
cheers
stefan
/////////////////////////////////////////////////////////////////////////////////////////////////////////
import org.apache.commons.io.FileUtils;
import org.apache.jackrabbit.core.TransientRepository;
import javax.jcr.Node;
import javax.jcr.Session;
import javax.jcr.SimpleCredentials;
import java.io.File;
import java.io.FileInputStream;
import java.util.Calendar;
public class JcrArtifactStoreTest {
static final String FILE_ROOT = "/Users/stefan/tmp/jackrabbit-src/";
static final boolean STORE_BINARY = false;
static int count = 0;
static long size = 0;
static long ts = 0;
public static void main(String[] args) throws Exception {
TransientRepository repository = new TransientRepository();
Session session = repository.login(new
SimpleCredentials("admin", "admin".toCharArray()));
ts = System.currentTimeMillis();
long ts0 = ts;
importNode(new File(FILE_ROOT), session.getRootNode());
session.save();
long ts1 = System.currentTimeMillis();
System.out.printf("%d ms: %d units persisted. data %s\n", ts1
- ts, count,
FileUtils.byteCountToDisplaySize(size));
ts = ts1;
System.out.println("Total time: " + (ts1 - ts0) + " ms");
}
static void importNode(File file, Node parent) throws Exception {
if (file.isDirectory()) {
Node newNode = parent.addNode(file.getName(), "nt:folder");
File[] children = file.listFiles();
if (children != null) {
for (int i = 0; i < children.length; i++) {
importNode(children[i], newNode);
}
}
} else {
Node newNode = parent.addNode(file.getName(), "nt:file");
String nt = STORE_BINARY ? "nt:resource" : "nt:unstructured";
Node content = newNode.addNode("jcr:content", nt);
if (STORE_BINARY) {
content.setProperty("jcr:data", new FileInputStream(file));
} else {
content.setProperty("jcr:data",
FileUtils.readFileToString(file));
}
content.setProperty("jcr:lastModified", Calendar.getInstance());
content.setProperty("jcr:mimeType", "application/octet-stream");
size += file.length();
count++;
if (++count % 500 == 0) {
parent.getSession().save();
long ts1 = System.currentTimeMillis();
System.out.printf("%d ms: %d units persisted. data
%s\n", ts1 - ts, count,
FileUtils.byteCountToDisplaySize(size));
ts = ts1;
}
}
}
}
>
> The code I used to persist data is given below. The pure IO time w/o
> jackrabbit is ~1second w/ solid state disk.
>
> Thanks for your comments,
> Marcel
>
> Mon Sep 26 15:39:05 CEST 2011: 200 units persisted. data 5 MB
> Mon Sep 26 15:39:11 CEST 2011: 400 units persisted. data 13 MB
> Mon Sep 26 15:39:21 CEST 2011: 600 units persisted. data 21 MB
> Mon Sep 26 15:39:31 CEST 2011: 800 units persisted. data 28 MB
> Mon Sep 26 15:39:35 CEST 2011: 1000 units persisted. data 33 MB
> Mon Sep 26 15:39:40 CEST 2011: 1200 units persisted. data 42 MB
> Mon Sep 26 15:39:44 CEST 2011: 1400 units persisted. data 49 MB
> Mon Sep 26 15:39:50 CEST 2011: 1600 units persisted. data 57 MB
> Mon Sep 26 15:39:54 CEST 2011: 1800 units persisted. data 65 MB
> Mon Sep 26 15:39:58 CEST 2011: 2000 units persisted. data 72 MB
> Mon Sep 26 15:40:10 CEST 2011: 2200 units persisted. data 88 MB
> Mon Sep 26 15:40:15 CEST 2011: 2400 units persisted. data 94 MB
> Mon Sep 26 15:40:22 CEST 2011: 2600 units persisted. data 102 MB
> Mon Sep 26 15:40:26 CEST 2011: 2800 units persisted. data 107 MB
> Mon Sep 26 15:40:30 CEST 2011: 3000 units persisted. data 113 MB
> Mon Sep 26 15:40:36 CEST 2011: 3200 units persisted. data 123 MB
> Mon Sep 26 15:40:40 CEST 2011: 3400 units persisted. data 129 MB
> Mon Sep 26 15:40:45 CEST 2011: 3600 units persisted. data 136 MB
> Mon Sep 26 15:40:48 CEST 2011: 3800 units persisted. data 140 MB
> Mon Sep 26 15:40:58 CEST 2011: 4000 units persisted. data 143 MB
> Mon Sep 26 15:41:18 CEST 2011: 4200 units persisted. data 154 MB
> Mon Sep 26 15:41:24 CEST 2011: 4400 units persisted. data 164 MB
> Mon Sep 26 15:41:38 CEST 2011: 4600 units persisted. data 185 MB
> Mon Sep 26 15:41:43 CEST 2011: 4800 units persisted. data 193 MB
> Mon Sep 26 15:41:50 CEST 2011: 5000 units persisted. data 204 MB
> Mon Sep 26 15:41:56 CEST 2011: 5200 units persisted. data 211 MB
> Mon Sep 26 15:42:00 CEST 2011: 5400 units persisted. data 218 MB
> Mon Sep 26 15:42:05 CEST 2011: 5600 units persisted. data 226 MB
> Mon Sep 26 15:42:10 CEST 2011: 5800 units persisted. data 235 MB
> Mon Sep 26 15:42:15 CEST 2011: 5927 units persisted
>
>
> public class JcrArtifactStoreTest {
>
> private TransientRepository repository;
> private Session session;
>
> @Before
> public void setup() throws RepositoryException {
>
> final File basedir = new File("recommenders/").getAbsoluteFile();
> basedir.mkdir();
> repository = new TransientRepository(basedir);
> session = repository.login(new SimpleCredentials("username",
> "password".toCharArray()));
> }
>
> @Test
> public void test2() throws ConfigurationException,
> RepositoryException, IOException {
>
> int i = 0;
> int size = 0;
> final Iterator<File> it = findDataFiles();
> final Node rootNode = session.getRootNode();
>
> while (it.hasNext()) {
> final File file = it.next();
> Node activeNode = rootNode;
> for (final String segment : new
> Path(file.getAbsolutePath()).segments()) {
> activeNode = JcrUtils.getOrAddNode(activeNode, segment);
> }
> // System.out.println(activeNode.getPath());
> final String content = Files.toString(file, Charsets.UTF_8);
> size += content.getBytes().length;
> activeNode.setProperty("cu", content);
> if (++i % 200 == 0) {
> session.save();
> System.out.printf("%s: %d units persisted. data %s
> \n", new Date(), i,
> FileUtils.byteCountToDisplaySize(size));
> }
> }
> session.save();
> System.out.printf("%s: %d units persisted\n", new Date(), i);
> }
>
> @SuppressWarnings("unchecked")
> private Iterator<File> findDataFiles() {
> return FileUtils.iterateFiles(new
> File("/Users/Marcel/Repositories/frankenberger-android-example-apps/"),
> FileFilterUtils.suffixFileFilter(".json"),
> TrueFileFilter.TRUE);
> }
>
>
>
>
> 2011/9/26 Stefan Guggisberg <[email protected]>:
>> hi marcel,
>>
>> On Sun, Sep 25, 2011 at 3:40 PM, Marcel Bruch <[email protected]> wrote:
>>> Hi,
>>>
>>> I'm looking for some advice whether Jackrabbit might be a good choice for
>>> my problem. Any comments on this are greatly appreciated.
>>>
>>>
>>> = Short description of the challenge =
>>>
>>> We've built a Eclipse based tool that analyzes java source files and stores
>>> its analysis results in additional files. The workspace potentially has
>>> hundreds of projects and each project may have up to a few thousands of
>>> files. Say, there will be 200 projects and 1000 java source files per
>>> project in a single workspace. Then, there will be 200*1000 = 200.000 files.
>>>
>>> On a full workspace build, all these 200k files have to be compiled (by the
>>> IDE) and analyzed (by our tool) at once and the analysis results have to be
>>> dumped to disk rather fast.
>>> But the most common use case is that a single file is changed several times
>>> per minute and thus gets frequently analyzed.
>>>
>>> At the moment, the analysis results are dumped on disk as plain json files;
>>> one json file for each java class. Each json file is around 5 to 100kb in
>>> size; some files grow up to several megabytes (<10mb), these files have a
>>> few hundred JSON complex nodes (which might perfectly map to nodes in JCR).
>>>
>>> = Question =
>>>
>>> We would like to change the simple file system approach by a more
>>> sophisticated approach and I wonder whether Jackrabbit may be a suitable
>>> backend for this use case. Since we map all our data to JSON already, it
>>> looks like Jackrabbit/JCR is a perfect fit for this but I can't say for
>>> sure.
>>>
>>> What's your suggestion? Is Jackrabbit capable to quickly load and store
>>> json-like data - even if 200k files (nodes + their sub-nodes) have to be
>>> updated very in very short time?
>>
>> absolutely. if the data is reasonably structured/organized jackrabbit
>> should be a perfect fit.
>> i suggest to leverage the java package space hierarchy for organizing the
>> data
>> (i.e. org.apache.jackrabbit.core.TransientRepository ->
>> /org/apache/jackrabbit/core/TransientRepository).
>> for further data modeling recommondations see [0].
>>
>> cheers
>> stefan
>>
>> [0] http://wiki.apache.org/jackrabbit/DavidsModel
>>
>>>
>>>
>>> Thanks for your suggestions. I've you need more details on what operations
>>> are performed or how data looks like, I would be glad to take your
>>> questions.
>>>
>>> Marcel
>>>
>